-
Notifications
You must be signed in to change notification settings - Fork 582
[OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-1354] Add Network Interface Type to AWS Machine Provider Spec #1065
Conversation
| // it should use the default of its subnet. | ||
| // +optional | ||
| PublicIP *bool `json:"publicIp,omitempty"` | ||
| // NetworkInterfaceType specifies the type of network interface to be used for the primary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we make use of secondary network interfaces? I don't mind only controlling this one, but I'd like to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support a single network interface within Machine API across all of the core cloud providers, I'm not sure why this was done as this was pre me joining RH
cbad8a9 to
bbf96a7
Compare
|
I don't see an enhancement linked here. I think there are some questions to answer before we grown our API surface.
These would all generally be answered in an enhancement. if the installer says it doesn't impact them and the networking team says it doesn't impact them, then I'm ok to proceed without one if the machine-api team says its easy and promises to write an automated test for it that will block payload promotion. We write enhancements to make this coordination easier. While I can appreciate that some changes are more urgent than others, urgently delivering half a solution or a solution that does not formalize a means to ensure that it continues to work over time, our future selves will not be pleased. /hold holding to get the ack/nack list from above worked out. oh, and the API itself looks pretty reasonable. |
|
For reference, this is where/how this will be used within Machine API: openshift/machine-api-provider-aws#8 We will also add webhook validation for the values of the enum once these first two PRs are merged |
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, generally looks good to me
|
Thanks for the feedback David, some notes in response to the questions
This will primarily be a machine API feature, in which case myself and others from the team have already reviewed this
I have tested the feature and there is no technical reason I have found for why this feature would break masters. You are correct that masters are created day 1 by the installer and as such, if we want to have support for this feature on masters, this will need to be added to the installer. As far as I have been told, this feature request from the customer was for day 2 initially, as it is possible to reconfigure existing VMs to use the feature, but they want autoscaling via MAPI to avoid toil once the cluster is bootstrapped. For the installer sake, I know they are very busy right now, but if they were to support this, the prerequisites are that the GO SDK and terraform both should support the feature. Based on the docs, the terrform provider has support and we should be able to include the option (also the code for the specific version) Likewise, the SDK Docs for the version used by the installer also include the option. I believe the only thing prohibiting the installer here would be a lack of time
Just to clarify, you would like to see a E2E blocking job that runs all machines with this option enabled to make sure all of our tests pass as normal? |
|
@deads2k we've been discussing this with @tkatarki and having masters to support is not a requirement currently. |
|
right having masters also use efa instances and efa interfaces is not an ask for the mvp. @deads2k and @romfreiman |
|
@danwinship anyone has experience with efa? Any thoughts on how sdn can be affected? |
I don't know anything about it, but from the docs it seems like EFAs offer a superset of the functionality of ordinary network adapters, so just switching the adapter type shouldn't break anything, hurt debuggability, or anything else. The docs say "An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself." We don't currently have that, so something is going to have to make that happen when EFAs are enabled. AFAIK currently the security groups are created by the installer in an undocumented manner which is not guaranteed to remain unchanged in the future, so stuff will have to be figured out here. (And also of course, changing the security groups in this way means we'd be dropping all firewalling between worker nodes when using EFAs, meaning there will be less total network security, but maybe people who want to use EFAs don't care about that.) It also seems like actually using the extra EFA functionality might require extra kernel modules, sysctl changes, etc? I don't see anything in the RFEs about making sure that the HPC functionality is actually usable with OCP when we enable EFAs, and if that hasn't happened, we should probably do that before adding APIs to enable them. |
|
Just wanted to add a note, as I'm being poked, that we have pressure from PM (@tkatarki knows more) to try and get this feature merged ahead of FF even without demonstrating that it can be used to end to end. The suggestion was to merge this and the feature implementation under regression testing, rather than waiting for the full end to end flow to be demonstrated. I would assume that if we don't get it working by code freeze, we would revert the feature out.
There is work happening at the moment to get full end to end feature working with extra required software deployed via daemonsets, @kwozyman is giving regular updates on slack for these. He has also managed to verify that this API and associated machine patch work as expected and allowed him to continue his development of the full end to end feature |
I'm generally willing to merge APIs without backing implementations (we do this often), but we do it after an enhancement has described how the feature itself impacts other teams (installer, machine-api, and network all appear to be coming together here). Without knowing how the feature impacts other teams, without knowing the scope of additional changes like kernel modules and sysctl changes, and without seeing how this can be made consistent across workers and masters (even as day 2), I don't think merging now and working it out later is the right path. The comments in this PR suggest to me that other teams have a stake in how this rolls out and they need a concrete design (openshift/enhancements) to read and comment against. That should happen before we merge an API and hope we can make a viable implementation without a design. /hold |
75493c2 to
f24df3c
Compare
f24df3c to
9ec7092
Compare
|
/hold cancel The enhancement has been approved and merged, I think we can carry on with this and prepare it for merging post branch splits at the end of the week |
alexander-demicev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/lgtm |
|
Just to preempt a possible question about how this interacts with placement groups, IMO they should be considered separately. Each of the features can be used separately and there are no cross dependencies between the API types or PRs to implement each of the features. The only reason these are linked at the moment is because we have a customer who is looking to get the absolute maximal performance from the EFA networking, for this, you can drop latency from approx 70-80us to 20us by using a cluster placement group with the EFA interface. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexander-demichev, deads2k, JoelSpeed, lobziik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label docs-approved We are not a no-FF team, so I'm adding these labels manually |
|
@JoelSpeed: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR adds the API to allow users to choose between standard network interfaces and AWS Elastic Fabric Adapter interfaces for the EC2 instance created by the Machine.
This is related to https://issues.redhat.com/browse/RFE-2236
More details on EFA can be found https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html
According to the AWS API reference docs, we should just need to set the Interface Type on the EC2 Network Interface request to the value of
interface, orefa. For now, I've left the PR keeping the two constants as matching the EC2 types, though if we think the names could be improved, I'm open to suggestions