[OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065

JoelSpeed · 2021-11-23T14:16:48Z

This PR adds the API to allow users to choose between standard network interfaces and AWS Elastic Fabric Adapter interfaces for the EC2 instance created by the Machine.

This is related to https://issues.redhat.com/browse/RFE-2236

More details on EFA can be found https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html

According to the AWS API reference docs, we should just need to set the Interface Type on the EC2 Network Interface request to the value of interface, or efa. For now, I've left the PR keeping the two constants as matching the EC2 types, though if we think the names could be improved, I'm open to suggestions

machine/v1beta1/types_awsprovider.go

deads2k · 2021-11-23T14:31:53Z

machine/v1beta1/types_awsprovider.go

 	// it should use the default of its subnet.
 	// +optional
 	PublicIP *bool `json:"publicIp,omitempty"`
+	// NetworkInterfaceType specifies the type of network interface to be used for the primary


do we make use of secondary network interfaces? I don't mind only controlling this one, but I'd like to know.

We only support a single network interface within Machine API across all of the core cloud providers, I'm not sure why this was done as this was pre me joining RH

deads2k · 2021-11-23T14:58:58Z

I don't see an enhancement linked here. I think there are some questions to answer before we grown our API surface.

does the regularly consuming team know this is coming. I think this is mostly the machine-api team. elmiko ACK - [OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065 (review)
does this work for masters. My understanding is that masters are created by the installer, not the machine-api. If this is the case, then the installer needs to review this API and determine how they will implement it. We need an installer ACK before this can merge.
from the networking team, I'd like to understand if this new mode causes them an issues with debugging or constrains their future choices.
does the QE team know that this is adding to the list of configurations to test

These would all generally be answered in an enhancement. if the installer says it doesn't impact them and the networking team says it doesn't impact them, then I'm ok to proceed without one if the machine-api team says its easy and promises to write an automated test for it that will block payload promotion. We write enhancements to make this coordination easier.

While I can appreciate that some changes are more urgent than others, urgently delivering half a solution or a solution that does not formalize a means to ensure that it continues to work over time, our future selves will not be pleased.

/hold

holding to get the ack/nack list from above worked out.

oh, and the API itself looks pretty reasonable.

JoelSpeed · 2021-11-23T15:26:21Z

For reference, this is where/how this will be used within Machine API: openshift/machine-api-provider-aws#8

We will also add webhook validation for the values of the enum once these first two PRs are merged

elmiko

+1, generally looks good to me

JoelSpeed · 2021-11-24T11:54:46Z

Thanks for the feedback David, some notes in response to the questions

does the regularly consuming team know this is coming. I think this is mostly the machine-api team.

This will primarily be a machine API feature, in which case myself and others from the team have already reviewed this

does this work for masters. My understanding is that masters are created by the installer, not the machine-api. If this is the case, then the installer needs to review this API and determine how they will implement it. We need an installer ACK before this can merge.

I have tested the feature and there is no technical reason I have found for why this feature would break masters. You are correct that masters are created day 1 by the installer and as such, if we want to have support for this feature on masters, this will need to be added to the installer.

As far as I have been told, this feature request from the customer was for day 2 initially, as it is possible to reconfigure existing VMs to use the feature, but they want autoscaling via MAPI to avoid toil once the cluster is bootstrapped.

For the installer sake, I know they are very busy right now, but if they were to support this, the prerequisites are that the GO SDK and terraform both should support the feature.

Based on the docs, the terrform provider has support and we should be able to include the option (also the code for the specific version)

Likewise, the SDK Docs for the version used by the installer also include the option.

I believe the only thing prohibiting the installer here would be a lack of time

promises to write an automated test for it that will block payload promotion

Just to clarify, you would like to see a E2E blocking job that runs all machines with this option enabled to make sure all of our tests pass as normal?

romfreiman · 2021-11-24T14:00:01Z

@deads2k we've been discussing this with @tkatarki and having masters to support is not a requirement currently.
bullet 3 is interesting - I assume that we should try running networking related tests on those interfaces and see how it goes.
And, of course, qe have to validate it before releasing, but this is true to every feature we develop, I dont think it's different from supporting various instance types - do we have coverage for each and every instance type in our CI?

tkatarki · 2021-11-24T14:05:17Z

right having masters also use efa instances and efa interfaces is not an ask for the mvp. @deads2k and @romfreiman

romfreiman · 2021-11-24T18:53:29Z

@danwinship anyone has experience with efa? Any thoughts on how sdn can be affected?

danwinship · 2021-12-04T14:31:34Z

@danwinship anyone has experience with efa? Any thoughts on how sdn can be affected?

I don't know anything about it, but from the docs it seems like EFAs offer a superset of the functionality of ordinary network adapters, so just switching the adapter type shouldn't break anything, hurt debuggability, or anything else.

The docs say "An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself." We don't currently have that, so something is going to have to make that happen when EFAs are enabled. AFAIK currently the security groups are created by the installer in an undocumented manner which is not guaranteed to remain unchanged in the future, so stuff will have to be figured out here. (And also of course, changing the security groups in this way means we'd be dropping all firewalling between worker nodes when using EFAs, meaning there will be less total network security, but maybe people who want to use EFAs don't care about that.)

It also seems like actually using the extra EFA functionality might require extra kernel modules, sysctl changes, etc? I don't see anything in the RFEs about making sure that the HPC functionality is actually usable with OCP when we enable EFAs, and if that hasn't happened, we should probably do that before adding APIs to enable them.

JoelSpeed · 2021-12-14T16:52:39Z

Just wanted to add a note, as I'm being poked, that we have pressure from PM (@tkatarki knows more) to try and get this feature merged ahead of FF even without demonstrating that it can be used to end to end. The suggestion was to merge this and the feature implementation under regression testing, rather than waiting for the full end to end flow to be demonstrated. I would assume that if we don't get it working by code freeze, we would revert the feature out.

It also seems like actually using the extra EFA functionality might require extra kernel modules, sysctl changes, etc? I don't see anything in the RFEs about making sure that the HPC functionality is actually usable with OCP when we enable EFAs, and if that hasn't happened, we should probably do that before adding APIs to enable them.

There is work happening at the moment to get full end to end feature working with extra required software deployed via daemonsets, @kwozyman is giving regular updates on slack for these. He has also managed to verify that this API and associated machine patch work as expected and allowed him to continue his development of the full end to end feature

deads2k · 2021-12-14T21:37:19Z

get this feature merged ahead of FF even without demonstrating that it can be used to end to end.

I'm generally willing to merge APIs without backing implementations (we do this often), but we do it after an enhancement has described how the feature itself impacts other teams (installer, machine-api, and network all appear to be coming together here). Without knowing how the feature impacts other teams, without knowing the scope of additional changes like kernel modules and sysctl changes, and without seeing how this can be made consistent across workers and masters (even as day 2), I don't think merging now and working it out later is the right path.

The comments in this PR suggest to me that other teams have a stake in how this rolls out and they need a concrete design (openshift/enhancements) to read and comment against. That should happen before we merge an API and hope we can make a viable implementation without a design.

/hold

JoelSpeed · 2022-01-25T12:49:21Z

/hold cancel

The enhancement has been approved and merged, I think we can carry on with this and prepare it for merging post branch splits at the end of the week

alexander-demicev

/lgtm

lobziik · 2022-01-25T12:59:09Z

/lgtm

JoelSpeed · 2022-01-28T13:08:53Z

Just to preempt a possible question about how this interacts with placement groups, IMO they should be considered separately. Each of the features can be used separately and there are no cross dependencies between the API types or PRs to implement each of the features.

The only reason these are linked at the moment is because we have a customer who is looking to get the absolute maximal performance from the EFA networking, for this, you can drop latency from approx 70-80us to 20us by using a cluster placement group with the EFA interface.

deads2k · 2022-02-03T21:04:44Z

/approve

openshift-ci · 2022-02-03T21:05:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexander-demichev, deads2k, JoelSpeed, lobziik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2022-02-04T10:15:15Z

/label docs-approved
/label px-approved
/label qe-approved

We are not a no-FF team, so I'm adding these labels manually

openshift-ci · 2022-02-04T10:23:51Z

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from alexander-demicev and mandre November 23, 2021 14:18

Fedosin reviewed Nov 23, 2021

View reviewed changes

machine/v1beta1/types_awsprovider.go Show resolved Hide resolved

alexander-demicev reviewed Nov 23, 2021

View reviewed changes

machine/v1beta1/types_awsprovider.go Show resolved Hide resolved

deads2k reviewed Nov 23, 2021

View reviewed changes

machine/v1beta1/types_awsprovider.go Show resolved Hide resolved

deads2k reviewed Nov 23, 2021

View reviewed changes

JoelSpeed force-pushed the aws-network-interface branch from cbad8a9 to bbf96a7 Compare November 23, 2021 14:49

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2021

JoelSpeed mentioned this pull request Nov 23, 2021

[OCPCLOUD-1354] Allow Machines to select EFA network interfaces openshift/machine-api-provider-aws#8

Merged

elmiko reviewed Nov 23, 2021

View reviewed changes

JoelSpeed force-pushed the aws-network-interface branch from 75493c2 to f24df3c Compare December 21, 2021 17:21

JoelSpeed added 2 commits January 25, 2022 11:46

Add Network Interface Type to AWS Machine Provider Spec

cda5afb

Update EFA options to match enhancement

9ec7092

JoelSpeed force-pushed the aws-network-interface branch from f24df3c to 9ec7092 Compare January 25, 2022 12:48

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2022

alexander-demicev approved these changes Jan 25, 2022

View reviewed changes

openshift-ci bot assigned alexander-demicev Jan 25, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2022

openshift-ci bot assigned lobziik Jan 25, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022

openshift-ci bot added docs-approved Signifies that Docs has signed off on this PR px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR labels Feb 4, 2022

openshift-merge-robot merged commit c4ca496 into openshift:master Feb 4, 2022

JoelSpeed deleted the aws-network-interface branch February 4, 2022 10:26

[OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065

[OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065

Uh oh!

Conversation

JoelSpeed commented Nov 23, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deads2k Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

deads2k commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoelSpeed commented Nov 23, 2021

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Nov 24, 2021

Uh oh!

romfreiman commented Nov 24, 2021

Uh oh!

tkatarki commented Nov 24, 2021

Uh oh!

romfreiman commented Nov 24, 2021

Uh oh!

danwinship commented Dec 4, 2021

Uh oh!

JoelSpeed commented Dec 14, 2021

Uh oh!

deads2k commented Dec 14, 2021

Uh oh!

JoelSpeed commented Jan 25, 2022

Uh oh!

alexander-demicev left a comment

Choose a reason for hiding this comment

Uh oh!

lobziik commented Jan 25, 2022

Uh oh!

JoelSpeed commented Jan 28, 2022

Uh oh!

deads2k commented Feb 3, 2022

Uh oh!

openshift-ci bot commented Feb 3, 2022

Uh oh!

JoelSpeed commented Feb 4, 2022

Uh oh!

openshift-ci bot commented Feb 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

deads2k commented Nov 23, 2021 •

edited

Loading