Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Deny serving Ignition to provisioned nodes #784

Open
wants to merge 1 commit into
base: master
from

Conversation

@cgwalters
Copy link
Member

cgwalters commented May 21, 2019

Ignition may contain secret data; pods running on the cluster
shouldn't have access.

This PR closes of access to any IP that responds on port 22, as that
is a port that is:

  • Known to be active by default
  • Not firewalled

A previous attempt at this was to have an auth token;
but this fix doesn't require changing the installer and people's PXE setups.

In the future we may reserve a port in the 9xxx range and have the
MCD respond on it so that admins who disable/firewall SSH don't
have indirectly reduced security.

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented May 21, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 21, 2019

To test this, I did:
oc debug node/<worker> and then did chroot /host iptables -I OPENSHIFT-BLOCK-OUTPUT -p tcp --dport 22623 -j ACCEPT to undo the SDN filtering, then from that same pod, I see:

# curl -k --head https://10.0.131.197:22623/config/master
HTTP/2 403 
content-length: 0
date: Tue, 21 May 2019 15:34:35 GMT

And in the MCS logs:

oc logs pods/machine-config-server-6snf2
I0521 15:27:25.572401       1 start.go:37] Version: 4.0.0-alpha.0-437-g307e6bea-dirty (307e6bea9ed1025202d431be21184fe9ea4f6066)
I0521 15:27:25.574374       1 api.go:52] launching server
I0521 15:27:25.574434       1 api.go:52] launching server
I0521 15:27:42.202818       1 api.go:106] Denying unauthorized request: Node ip-10-0-134-102.us-east-2.compute.internal with address 10.0.134.102 is already provisioned
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented May 21, 2019

 func manifestsMachineconfigserverClusterroleYamlBytes() ([]byte, error) {
make: *** [verify] Error 1
hack/../pkg is out of date. Please run make update
@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 8f0585d to 3693948 May 21, 2019
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 21, 2019

From a disaster recovery perspective, I think if you have e.g. master node with a static IP that you want to reprovision, in order to allow access you'd need to edit the node object to drop its status/addresses data.

Or we could add a config flag to allow this.

In practice I think most people are going to start with trying to recover a master in-place rather than completely re-set it, and that path won't be affected by this.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 21, 2019

OK, passing the main tests now, though upgrade looks stuck but I doubt it's related.

I also verified that scaling up the worker machineset still works, i.e. it doesn't deny the request.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 21, 2019

Maybe a better architecture would be for the MCS to make an HTTP request to the MCC asking if a given IP is OK? Would increase the availability requirement for the MCC. Eh, this is probably fine. If there's some transient apiserver issue it will mean a node's Ignition request will fail, but Ignition will keep retrying.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 21, 2019

/retest

@abhinavdahiya

This comment has been minimized.

Copy link
Member

abhinavdahiya commented May 22, 2019

In practice I think most people are going to start with trying to recover a master in-place rather than completely re-set it, and that path won't be affected by this.

Do we have concrete data from Dr team on this. Seems like Amazon taking a VM away might also be common....

How do we know that the interface that acts as source ip is the one node is reporting as it's internal/public IP?

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 22, 2019

How do we know that the interface that acts as source ip is the one node is reporting as it's internal/public IP?

Bear in mind this PR is not claiming to be a complete solution to the problem. It's adding a layer of defense, much like the other layers we added. For example, it does nothing about external access. The auth key approach would be much stronger.

I'll check with the network team about this but remember the primary thing we're trying to prevent with this is in-cluster pods accessing the endpoint. I have some confidence that that access will appear to come from the IP address associated with the node that the kubelet reports. But again let's see what the SDN team says.

Seems like Amazon taking a VM away might also be common....

Right, instance retirement definitely happens. As I said, in that case if a newly provisioned master happens to get the same IP you'd need to explictly drop out the node object.

Or alternatively, we could tweak this PR to only disallow reachable nodes which would be pretty easy.

@danwinship

This comment has been minimized.

Copy link

danwinship commented May 22, 2019

the primary thing we're trying to prevent with this is in-cluster pods accessing the endpoint. I have some confidence that that access will appear to come from the IP address associated with the node that the kubelet reports.

Assuming the pod isn't using an egress IP or egress router, then traffic from a pod on node A addressed to node B's primary node IP will appear to come from node A's primary node IP.

But we didn't change the MCS to not listen on 0.0.0.0 did we? So a pod on node A could connect to node B's tun0 IP instead, and that would appear to come from the pod's IP directly. (So you'd want to filter out all connections with source IPs in the pod network as well. Or more simply, filter out connections if the destination IP is the tun0 IP.)

Also, in some environments, if nodes have multiple IPs, then a connection from a pod on node A to a non-primary IP on node B might appear to come from node A's non-primary IP rather than its primary IP.

if a newly provisioned master happens to get the same IP you'd need to explictly drop out the node object.

VMware at least definitely leaves stale node objects around when dynamically scaling nodes. (I don't know if that would ever effect masters though.)

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 22, 2019

(So you'd want to filter out all connections with source IPs in the pod network as well. Or more simply, filter out connections if the destination IP is the tun0 IP.)

Ah, OK. Hmm though I'm not sure we can get that from the golang side HTTP request... but thinking about this, why don't we just add iptables rules to the masters that require that the destination IP for the MCS is not tun0?

@danwinship

This comment has been minimized.

Copy link

danwinship commented May 22, 2019

iptables doesn't see the packet because it's delivered by OVS

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 22, 2019

/hold
Per discussion above, this needs more work to implement the suggestions in #784 (comment)

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 3693948 to 1cc5062 May 24, 2019
@openshift-ci-robot openshift-ci-robot added size/L and removed size/M labels May 24, 2019
@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 1cc5062 to d0fcc92 May 24, 2019
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 24, 2019

/hold cancel

OK, updated 🆕 to also deny requests coming from tun0. I verified both approaches work, doing a request to the main master IP directly, as well as targeting its tun0.

oc logs pods/machine-config-server-l7wd2
I0523 22:27:09.952731       1 start.go:37] Version: 4.0.0-alpha.0-444-g9eebd1d4-dirty (9eebd1d4a17eb2d26ae74709252cf6ea77330703)
I0523 22:27:09.955394       1 api.go:52] launching server
I0523 22:27:09.955474       1 api.go:52] launching server
I0524 18:00:24.919941       1 api.go:99] Pool master requested by 10.0.137.206:32870
I0524 18:00:24.936254       1 api.go:106] Denying unauthorized request: Node ip-10-0-137-206.us-east-2.compute.internal with address 10.0.137.206 is already provisioned
I0524 18:06:39.321439       1 api.go:99] Pool master requested by 10.131.0.1:59360
I0524 18:06:39.330104       1 api.go:106] Denying unauthorized request: Requesting host 10.131.0.1 is within pod network CIDR 10.128.0.0/14

We also now only deny nodes that are NodeReady=true.

And I verified with this patch that scaling up a machineset still works.

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from d0fcc92 to d408e61 May 24, 2019
}

for _, node := range nodes.Items {
// Ignore NodeReady=false nodes; this allows reprovisioning them.

This comment has been minimized.

Copy link
@abhinavdahiya

abhinavdahiya May 24, 2019

Member

Node not ready doesn't mean node is getting re-provisioned. where did we get that assumption?

@abhinavdahiya

This comment has been minimized.

Copy link
Member

abhinavdahiya commented May 24, 2019

Also, in some environments, if nodes have multiple IPs, then a connection from a pod on node A to a non-primary IP on node B might appear to come from node A's non-primary IP rather than its primary IP.

This is still not solved.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 24, 2019

[multi-NIC] is still not solved.

Right, but...again, not claiming a comprehensive solution here. The issue is important enough to have layers of defenses.

That said...one approach that we could take (derived from an approach @ericavonb mentioned) is run a daemonset service on each node (probably as part of the MCD) that listens on a well-known port and provides a static reply. Then we could have the MCS "call back" to the requesting IP - if it gets a reply it knows to deny the request.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented May 24, 2019

Or...maybe way simpler, just try to connect to port 22.

@crawford

This comment has been minimized.

Copy link
Member

crawford commented Jun 11, 2019

I think we do need to worry about that delay. The time it takes to provision a new node is a key metric that is used when comparing K8S offerings.

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 3d29541 to ffdc7ff Jun 13, 2019
@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from ffdc7ff to af9cb90 Jun 14, 2019
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Jun 14, 2019

I think we do need to worry about that delay. The time it takes to provision a new node is a key metric that is used when comparing K8S offerings.

It seems to take about 4 minutes to provision a new worker for me - offhand that time looks like it's dominated by the early pivot. So this adds 2% to the provisioning time but...yeah agree it's the wrong direction.

I don't understand at all why we're getting connection timeouts instead of having the connection refused; may relate to the network state in the initramfs? Going to dig at that.

@michaelgugino

This comment has been minimized.

Copy link
Contributor

michaelgugino commented Aug 6, 2019

@cgwalters what about checking for listening on node's server port? All nodes run that port, we took away ability to change the port (afaik). This would definitively say 'this is a node'.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Aug 7, 2019

what about checking for listening on node's server port? All nodes run that port, we took away ability to change the port (afaik). This would definitively say 'this is a node'.

Sorry, which port is that? You're not talking about the kubelet 10250 one?

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from af9cb90 to fc2b74b Aug 7, 2019
@michaelgugino

This comment has been minimized.

Copy link
Contributor

michaelgugino commented Oct 17, 2019

what about checking for listening on node's server port? All nodes run that port, we took away ability to change the port (afaik). This would definitively say 'this is a node'.

Sorry, which port is that? You're not talking about the kubelet 10250 one?

Yeah, I'm talking about whatever port the kubelet listens on for incoming connections (I know it as the 'server' port, I'm not sure what it's actually called, but it's for getting pod logs and such).

Sorry it took me 10 years to reply, too much spam from github. @cgwalters

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch 3 times, most recently from 9b1c265 to 97cf082 Oct 18, 2019
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Oct 18, 2019

OK I think I finally got this commit correctly past make verify and make test-unit, my bad for not running those locally.
Now we have some registry flakes in builds.
/retest

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Oct 18, 2019

OK yeah, still adds 5s:

2019-10-18T18:31:10.248255764Z I1018 18:31:10.248203       1 api.go:99] Pool worker requested by 10.0.132.105:35785
2019-10-18T18:31:15.342547142Z I1018 18:31:15.342482       1 cluster_server.go:100] Checking provisioned port 10250: dial tcp 10.0.132.105:10250: i/o timeout
2019-10-18T18:31:15.342547142Z I1018 18:31:15.342504       1 cluster_server.go:100] Checking provisioned port 22: dial tcp 10.0.132.105:22: i/o timeout
2019-10-18T18:31:15.342547142Z I1018 18:31:15.342528       1 cluster_server.go:91] Checked address 10.0.132.105 for provisioning in 5.000313618s
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Oct 18, 2019

I am not able to reproduce the timeout at least locally in libvirt - I immediately get ECONNREFUSED when the system is sitting there in the initramfs.

OH! But here's something interesting...this delay is AWS specific! Which we can tell now that we're running this operator's test suite in GCP (i.e. it's now e2e-gcp-op):

See this log:

2019-10-18T18:25:40.924407037Z I1018 18:25:40.924056       1 start.go:38] Version: machine-config-daemon-4.2.0-201907161330-345-gd0a52f85-dirty (d0a52f8506e63a74612dd4713b0dd24d8cbada5a)
2019-10-18T18:25:40.925178352Z I1018 18:25:40.925148       1 api.go:52] Launching server on :22624
2019-10-18T18:25:40.925250916Z I1018 18:25:40.925209       1 api.go:52] Launching server on :22623
2019-10-18T18:28:56.405696136Z I1018 18:28:56.405619       1 api.go:99] Pool worker requested by 10.0.32.3:37624
2019-10-18T18:28:56.517856181Z I1018 18:28:56.517235       1 cluster_server.go:100] Checking provisioned port 22: dial tcp 10.0.32.3:22: connect: connection refused
2019-10-18T18:28:56.517856181Z I1018 18:28:56.517309       1 cluster_server.go:100] Checking provisioned port 10250: dial tcp 10.0.32.3:10250: connect: connection refused
2019-10-18T18:28:56.517856181Z I1018 18:28:56.517321       1 cluster_server.go:91] Checked address 10.0.32.3 for provisioning in 1.859659ms
2019-10-18T18:29:00.587612577Z I1018 18:29:00.587561       1 api.go:99] Pool worker requested by 10.0.32.2:51946
2019-10-18T18:29:00.597302341Z I1018 18:29:00.595481       1 cluster_server.go:100] Checking provisioned port 22: dial tcp 10.0.32.2:22: connect: connection refused
2019-10-18T18:29:00.597302341Z I1018 18:29:00.595507       1 cluster_server.go:100] Checking provisioned port 10250: dial tcp 10.0.32.2:10250: connect: connection refused
2019-10-18T18:29:00.597302341Z I1018 18:29:00.595547       1 cluster_server.go:91] Checked address 10.0.32.2 for provisioning in 1.518867ms
2019-10-18T18:29:03.833399559Z I1018 18:29:03.833333       1 api.go:99] Pool worker requested by 10.0.32.4:44096
2019-10-18T18:29:03.841810966Z I1018 18:29:03.841737       1 cluster_server.go:100] Checking provisioned port 22: dial tcp 10.0.32.4:22: connect: connection refused
2019-10-18T18:29:03.841866064Z I1018 18:29:03.841822       1 cluster_server.go:100] Checking provisioned port 10250: dial tcp 10.0.32.4:10250: connect: connection refused
2019-10-18T18:29:03.841866064Z I1018 18:29:03.841843       1 cluster_server.go:91] Checked address 10.0.32.4 for provisioning in 1.706479ms

Which looks great, exactly what we want.

Is there some sort of ingress firewalling on AWS that's lifted after the machines have booted or something?

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Oct 18, 2019

So given that the delay is seemingly AWS specific, I think I'd like to propose merging this - the security benefits seem quite worth it. We can let it cook in master for a while and see if there's any unexpected fallout elsewhere. We can then revert if anything breaks (or perhaps make this code into a runtime option via configmap?). And we can investigate/debug the AWS delay in parallel.

@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 97cf082 to 701a736 Oct 21, 2019
@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Nov 25, 2019

Now that 4.4 is open, any objections to landing this?

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Nov 25, 2019

No objection from me!

@abhinavdahiya

This comment has been minimized.

Copy link
Member

abhinavdahiya commented Nov 25, 2019

I just have few things.

RHEL 7 nodes are joined to the cluster using the ignition server and those machines are provisioned. We need to make sure they keep working.

Would like to keep requests from 127.0.0.1 or something to make sure we have a way to debug the content being served

also since this might affect users.. and enhancement that provides the details on new behavior would be very nice.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Nov 25, 2019

RHEL 7 nodes are joined to the cluster using the ignition server and those machines are provisioned. We need to make sure they keep working.

See openshift/openshift-ansible#11614 - for the same reasons motivating this PR, openshift-ansible doesn't query the MCS directly anymore.

Would like to keep requests from 127.0.0.1 or something to make sure we have a way to debug the content being served

OK, done.

also since this might affect users.. and enhancement that provides the details on new behavior would be very nice.

Understood, though a few things:

First, this is building on top of a lot of patches to other components that came before the enhancements process - most of such an enhancement would probably be writing up all of the history behind the current state, right?

Second and more importantly: it's possible this could break a user, but the reason I'm proposing this approach is I think it's unlikely. In general, anyone who has a valid reason to get data from the MCS should also have the credentials to pull it from the cluster directly (as openshift-ansible does now).

Third, part of the idea here is to land this in master and let it soak for a bit to see if e.g. something unusual in bare metal CI shakes out - if there's fallout we can revert, whereas an enhancement feels like more of a commitment.

Finally, I'd like to spend any enhancement-writing time related to this on strengthening the "machine identity" model - ideally supporting TPMs, including linkage between the MCO and machineAPI, probably having the MCO be able to match up machine objects when serving Ignition, etc.

Ignition may contain secret data; pods running on the cluster
shouldn't have access.

First, we deny any request that appears to come from the pod overlay
network.  This closes off a lot of avenues without any risk.

However, we can't guarantee all in-cluster requests appear to originate from
the pod network; in some cases according to the SDN team, particularly
for machines that have multiple NICs.

Hence, this PR also closes off access to any IP that responds on
port 22, as that is a port that is:

 - Known to be active by default
 - Not firewalled

A previous attempt at this was to have an [auth token](#736);
but this fix doesn't require changing the installer and people's PXE setups.

In the future we may reserve a port in the 9xxx range and have the
MCD respond on it so that admins who disable/firewall SSH don't
have indirectly reduced security.
@cgwalters cgwalters force-pushed the cgwalters:mcs-check-machines branch from 701a736 to 24cecd5 Nov 25, 2019
@crawford

This comment has been minimized.

Copy link
Member

crawford commented Nov 26, 2019

Second and more importantly: it's possible this could break a user, but the reason I'm proposing this approach is I think it's unlikely.

This has the potential to significantly complicate disaster recovery. How much thought has been put into how this will affect those procedures?

Finally, I'd like to spend any enhancement-writing time related to this on strengthening the "machine identity" model - ideally supporting TPMs, including linkage between the MCO and machineAPI, probably having the MCO be able to match up machine objects when serving Ignition, etc.

This has already been solved at the Kubernetes layer, which is where I still believe we should tackle this problem. I'm still unconvinced that Ignition should learn how to help attest the identity of the underlying machine.

@cgwalters

This comment has been minimized.

Copy link
Member Author

cgwalters commented Nov 26, 2019

This has the potential to significantly complicate disaster recovery. How much thought has been put into how this will affect those procedures?

How so? If you want to re-run Ignition manually or something?

This has already been solved at the Kubernetes layer, which is where I still believe we should tackle this problem. I'm still unconvinced that Ignition should learn how to help attest the identity of the underlying machine.

I don't think we've solved machine identity well at the Kubernetes layer - why do you think it is? I really think the https://github.com/openshift/cluster-machine-approver is a hack personally...
There's some prior work on TPMs for Kube, see https://github.com/coreos/kubernetes-dtc-admission at least.

(This isn't trying to get machine identity into Ignition really, but Ignition would be part of enabling it at a higher level)

@kikisdeliveryservice

This comment has been minimized.

Copy link
Member

kikisdeliveryservice commented Nov 27, 2019

/skip

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Nov 27, 2019

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-etcd-quorum-loss 3d29541 link /test e2e-etcd-quorum-loss
ci/prow/e2e-aws-disruptive fc2b74b link /test e2e-aws-disruptive
ci/prow/verify 24cecd5 link /test verify
ci/prow/e2e-aws-scaleup-rhel7 24cecd5 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-gcp-op 24cecd5 link /test e2e-gcp-op
ci/prow/e2e-vsphere 24cecd5 link /test e2e-vsphere

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@michaelgugino

This comment has been minimized.

Copy link
Contributor

michaelgugino commented Dec 4, 2019

Commit message should be updated to reflect kubelet port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.