Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

required security group ports depend on choice of network plugin #1218

Closed
danwinship opened this issue Feb 8, 2019 · 18 comments
Closed

required security group ports depend on choice of network plugin #1218

danwinship opened this issue Feb 8, 2019 · 18 comments

Comments

@danwinship
Copy link
Contributor

danwinship commented Feb 8, 2019

Currently data/data/aws/vpc/sg-{master,worker}.tf hardcode the fact that nodes need UDP port 4789 (VXLAN) open between them for the SDN. But with ovn-kubernetes, they will instead need UDP port 6081 (Geneve aka "Generic Network Virtualization Encapsulation") for the pod network, as well as TCP ports 6641 and 6642 for communication between OVN daemons. Other network plugins might use other ports. So there will eventually need to be some sort of abstraction. (Or I guess we could just open all of those ports regardless of network plugin. I'm not sure if we care about supporting third-party plugins on AWS anyway.)

@squeed
Copy link
Contributor

squeed commented Feb 21, 2019

I've been doing some thinking about this, and I wonder if the answer is for the network operator to manage the security groups.

We're going to have to do something like this if we ever support migrating between SDN providers anyways. I believe the Kuryr team's approach is similar - they've moved almost all networking setup code out of Terraform and in to the network operator.

This does mean that we need to give the network operator AWS credentials, but there are already other operators that do similar things

Ideally we could abstract this away and create a desired list of open ports, and the machine-api-operator would handle this. However, we have a bootstrapping problem because the network needs to come up before the MAO in our current architecture.

Are there long-term plans to move cloud creation out of the installer?

@danwinship
Copy link
Contributor Author

This does mean that we need to give the network operator AWS credentials

and knowledge of how to open ports on each supported cloud platform... It seems weird to have cloud-specific knowledge in CNO itself.

@wking
Copy link
Member

wking commented Feb 21, 2019

and knowledge of how to open ports on each supported cloud platform... It seems weird to have cloud-specific knowledge in CNO itself.

You may be able to add a port-opening abstraction under the cluster-API umbrella. But somebody is going to have to know how to do this on all platforms, so I don't know if we save much by shuffling around within the cluster. Moving from the installer into the cluster seems like something you'd need to support migrations.

@squeed
Copy link
Contributor

squeed commented Feb 28, 2019

QE found an (annoyingly obvious) bug - we give users a knob in the network operator to change the vxlan port, which is critical in certain platforms an deployments. Of course, they can't actually test it, because all UDP ports are blocked except 4789.

As a quick fix, I'm going to expand the "system services" range to also allow udp. But we will probably just need to bite the bullet and do this in the operator.

This is looking like a bit of a mess. I'd like there to be a clean handoff point to the operator, which means a single security group for all machines. Right now there are entirely disjoint SGs for workers and masters.

@danwinship
Copy link
Contributor Author

That feature is only needed on VMware though. We could just drop it for 4.0. (We didn't provide any way to open non-standard VXLAN ports on AWS in 3.x either.)

@wking
Copy link
Member

wking commented Feb 28, 2019

On Feb 28, 2019 04:54, Casey Callendrello wrote:

I'd like there to be a clean handoff point to the operator, which means a single security group for all machines. Right now there are entirely disjoint SGs for workers and masters.

Or have the operator create a new networking security group that it associates with all nodes? The installer could keep providing separate compute and control-plane groups for the other ports.

@squeed
Copy link
Contributor

squeed commented Feb 28, 2019

Or have the operator create a new networking security group that it associates with all nodes? The installer could keep providing separate compute and control-plane groups for the other ports.

Then the operator would need to know how to associate the security group with all nodes, which strikes me as something we don't want to do.

@danwinship
Copy link
Contributor Author

I think @wking might have meant "have the installer create a new networking security group"?

@wking
Copy link
Member

wking commented Feb 28, 2019

I think @wking might have meant "have the installer create a new networking security group"?

No, but that would work too. If we created it in Terraform (which we run after rendering our manifests), then we wouldn't have the ID to put into a manifest. But we could put a name- or tag-based filters into the manifest, the networking operator could look it up, and then just manage rules on that security group (while the usual machine-API tooling took care of keeping it associated with new nodes).

@danwinship
Copy link
Contributor Author

This does mean that we need to give the network operator AWS credentials, but there are already other operators that do similar things

It looks like https://github.com/openshift/cloud-credential-operator is what you're supposed to use. But it doesn't start until well after the network is up, so we can't use it to help bring the network up.

I guess the CNO already has cluster-admin, so it can just request the kube-system/aws-creds Secret itself and then use that.

@pecameron
Copy link

Can we start with kube-system/aws-creds and after the ports are set up, drop them? We don't need them on an ongoing basis.

@wking
Copy link
Member

wking commented Apr 10, 2019

Can we start with kube-system/aws-creds and after the ports are set up, drop them? We don't need them on an ongoing basis.

If you have cluster-admin, can't you always drop them and then re-fetch them whenever you need them (e.g. for subsequent migrations)? In the future, you could try requesting a cred from the cred operator and fall back to grabbing kube-system/aws-creds, but you don't have to start there.

@dcbw
Copy link
Contributor

dcbw commented Apr 24, 2019

@squeed so what's the consensus here? Time is getting short for 4.1.

@danwinship
Copy link
Contributor Author

@wking so actually, we're going to want to distinguish masters and workers in the network plugin rules; eg, for ovn-kubernetes, we want to open the GENEVE port between all nodes, but the ovn-northd port should only be open from masters to masters, and the ovn-southd port should be open from masters-or-workers to masters. So maybe we should just modify the existing master and worker security groups rather than having a separate networking security group?

If so, how can we reliably find those groups? It looks like right now we could fetch the Infrastructure.config.openshift.io object, look at its infrastructureName, and then search for security groups tagged with Key=Name and Value=${infrastructureName}-master-sg or Value=${infrastructureName}-worker-sg, but is that guaranteed to keep working?

@tmjd
Copy link
Contributor

tmjd commented May 24, 2019

The changes needed for network operators are not restricted to Security Group allowances only. At least for Calico it is desirable to be able to disable source/destination checks on masters and nodes.
For nodes I hope this will be possible through updating the AWSMachineProviderConfig (with configuration options that are not yet available) but for the masters I am less sure if that is an option. At least the source/destination checks can be disabled while an instance is running but are there any cases where it is not?

I wanted to bring this up to make sure the full scope of what needs to be managed/updated is considered.

Sorry I'm not totally clear on how machine configuration is done/handled so please correct me if it seems I'm misunderstanding something.

@squeed
Copy link
Contributor

squeed commented May 24, 2019

So, for 4.1 and 4.2, the plan is to stick with static VPC and SG management. There's just not enough time in the 4.2 cycle to change anything up.

For 4.3, the openshift network operator will be managing the security group settings necessary for the overlay network to function. (The installer will still be expected to install the initial basic infrastructure). I'll make sure that the design includes the ability for third-party components to request open ports.

As for source+destination port security, what would be a reasonable API for this? I do think it might have to be part of the installer, since it's a machine property and needs to be set on the control plane before the CNO comes up.

@abhinavdahiya
Copy link
Contributor

So it looks like we have an consensus that the networking operators will be doing this.

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closing this issue.

In response to this:

So it looks like we have an consensus that the networking operators will be doing this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants