packet: disable syncing allowed SSH keys on nodes #471

invidian · 2020-05-22T05:09:42Z

This commit disables syncing authorized SSH keys for core user on nodes
from Packet's user's keys and project's keys, so only keys which are
specified in the Lokomotive configuration are actually authorized, as
having more keys allowed than specified in the configuration might be a
potential security threat.

Closes #465

Signed-off-by: Mateusz Gozdek mateusz@kinvolk.io

invidian · 2020-05-22T08:57:16Z

Big question here is, do we want to test it now, and if yes, how?

johananl · 2020-05-22T11:36:41Z

Big question here is, do we want to test it now, and if yes, how?

By "test", do you mean "test automatically"? AFAICT all we have to do to test this change manually is deploy a cluster with one key specified in Ignition, unless I've completely missed your point.

invidian · 2020-05-22T12:25:18Z

By "test", do you mean "test automatically"? AFAICT all we have to do to test this change manually is deploy a cluster with one key specified in Ignition, unless I've completely missed your point.

Yes, I meant "test automatically", testing manually does not count :) Anyway, I have an idea how to test that. CI takes SSH key from SSH_KEY environment variable, so I'll grab that and create a DaemonSet defined below from the test and then wait until it converges. I'd use Job, but it is not possible to run it on all nodes and this is something we should run on all of them:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test-ssh-keys
spec:
  selector:
    matchLabels:
      name: test-ssh-keys
  template:
    metadata:
      labels:
        name: test-ssh-keys
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: test-ssh-keys
        image: ubuntu
        command: ["bash"]
        args: ["-c", "md5sum --status -c <(echo e45cf23b986ef85576c1721602d6c0d3 /home/core/.ssh/authorized_keys) && tail -f /dev/null"]
        volumeMounts:
        - name: ssh
          mountPath: /home/core/.ssh
          readOnly: true
      volumes:
      - name: ssh
        hostPath:
          path: /home/core/.ssh

johananl · 2020-05-22T12:49:36Z

Where will this logic sit? Inside a Go test?

invidian · 2020-05-22T13:17:25Z

Where will this logic sit? Inside a Go test?

Yes, in our e2e test suite.

invidian · 2020-05-22T14:23:13Z

Implemented tests for that, it was fun!

iaguis

A couple of nits.

test/system/ssh_keys_test.go

iaguis · 2020-05-25T17:01:03Z

This doesn't seem to be tested:

=== RUN   TestNoExtraSSHKeysOnNodes
=== PAUSE TestNoExtraSSHKeysOnNodes
=== CONT  TestNoExtraSSHKeysOnNodes
    TestNoExtraSSHKeysOnNodes: ssh_keys_test.go:93: "SSH_KEY" environment variable not set
--- SKIP: TestNoExtraSSHKeysOnNodes (0.00s)

I think the right env var is $PUB_KEY.

invidian · 2020-05-25T17:20:10Z

I think the right env var is $PUB_KEY.

Oops, fixed, thanks!

johananl

Added some comments (the main one is missing articles in comment strings).

I can't get the test to pass - the DS pods are crashlooping and there is no output in the logs. I've created a Packet cluster and used the following commands to run the test:

export PUB_KEY=$(cat key.pub)
go test -v -tags e2e,packet ./test/system -run TestNoExtraSSHKeysOnNodes

Output:

=== RUN   TestNoExtraSSHKeysOnNodes
=== PAUSE TestNoExtraSSHKeysOnNodes
=== CONT  TestNoExtraSSHKeysOnNodes
    TestNoExtraSSHKeysOnNodes: util.go:65: using KUBECONFIG=/home/johannes/kinvolk/clusters/packet-ssh-keys/assets/cluster-assets/auth/kubeconfig
    TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
    TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
    ...
    TestNoExtraSSHKeysOnNodes: util.go:134: error while waiting for the daemonset: timed out waiting for the condition
--- FAIL: TestNoExtraSSHKeysOnNodes (300.25s)
FAIL
FAIL	github.com/kinvolk/lokomotive/test/system	300.257s
FAIL

test/components/util/util.go

johananl · 2020-05-25T16:42:58Z

test/components/util/util.go

+const (
+	// RetryInterval is the default retry interval for tests using retry strategy, so they
+	// don't have to specify their own local intervals.
+	RetryInterval = time.Second * 5


Why put these consts here? They are used in test/system/ssh_keys_test.go so I think they should live as close as possible to where they are used. I think we want to minimize our reliance on "util" packages, not extend it.

I'm assuming your rationale is to avoid duplication when new code is added in the future which may need to use retries. My suggestion is to do the refactoring if/when that happens, because then you will have the necessary info regarding the best place to put these consts.

This prepares a ground for a better code, by adding the const and then actually using them. Refactoring is out of scope of this PR. And I want to avoid adding yet another duplication.

This way, newly added code should use those const, meaning once we actually do the refactoring, we have less work to do.

I'm not against adding the constants, I'm against adding them in a separate file because we might need to reuse them at some unknown point in the future.

Would you consider doing the change when it is actually needed rather than based on a theoretical predication of the future? The price we are paying by doing this right now is that we expand our reliance on "util" packages. If we put any code piece which we may be able to reuse in the future in a "util" package, we'd gradually be moving most of our logic there.

If you do the optimization once you encounter a concrete problem (e.g. "oh, I need to duplicate this code to make it work"), you will have a better idea where to put this code based on the concrete case you have.

I would have to duplicate this code already:

$ git grep 'time.Minute\*5' test/ test/components/cert-manager/cert-manager_test.go: testutil.WaitForDeployment(t, client, namespace, test.deployment, time.Second*5, time.Minute*5) test/components/contour/contour_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5) test/components/contour/contour_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/coredns/coredns_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/dex/dex_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/external-dns/external-dns_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/flatcar-linux-update-operator/fluo_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5) test/components/flatcar-linux-update-operator/fluo_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/gangway/gangway_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/metallb/metallb_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5) test/components/metallb/metallb_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/metrics-server/metrics_server_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/openebs-operator/openebs_operator_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5) test/components/rook/rook_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)

This change will prevent exactly that.

I understand you are bothered by "duplicating" the expression time.Minute*5 inside test/components. I don't understand how the test/components package is related to this PR (which touches SSH keys on Packet). I also don't think that repeating time.Minute*5 can be considered code duplication, unless we are assuming that all these tests must use the same value there. If the "duplication" in test/components is what we're discussing here, I'd say it's completely out of scope and move that part into a separate PR.

FWIW, if we decide that all tests should use the same time.Minute*5 value, the way I would solve it is by defining the constant in test/components and avoid the util package. To be precise, I'd put the following code in test/components/components.go (or even util.go if you insist):

package components const ( RetryInterval = time.Second * 5 )

For this PR I would not use a constant that's intended for using by components, because the code in this PR isn't related to components. In case you are arguing that we should use one global value for all tests, I would simply define the constant in the test package and import it into any "child" package under test/.

Lastly, introducing some duplication is IMO much cheaper than adding an import between two otherwise-unrelated packages.

In case I wasn't able to convince you with the above, please go ahead with whatever works for you. We need this change to get merged and I don't want to block it any further.

FWIW I agree with @johananl here.

I removed the constants.

johananl · 2020-05-25T16:53:42Z

test/system/ssh_keys_test.go

+// +build aws aws_edge baremetal packet
+// +build e2e
+
+package system_test


I suggest we call this package system. It's already under test/ so test/system_test repeats "test" unnecessarily (see the "do not stutter" recommendation). Also, underscores or camel case are usually discouraged in the Go community. If you feel it's absolutely necessary to include "test" in the package name, I would consider systest or systemtest to make the package name more idiomatic. Relevant info: https://blog.golang.org/package-names

This is to satisfy new linter, which IMO we should incorporate also for other code: https://github.com/maratori/testpackage.

Have we agreed on using this new linter? I'm not familiar with this tool. Anyway, if that's what this linter is suggesting, I'd say we need to reconsider using it :-/

Generally speaking, we should be following best practices and this is one of them. The linter enforces them. If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.

In this particular case, using this package name for unit tests enforces only testing exported functions, which allows to do refactoring in the future. It does not make much sense for e2e tests though, but I don't see a reason why we should make a distinction here.

Generally speaking, we should be following best practices and this is one of them. The linter enforces them.

Right. The relevant Go best practice is here: https://blog.golang.org/package-names
Where is the best practice which says system_test is preferred? :-/

If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.

Following that logic, shouldn't we have a team decision to add a linter, too? In any case, IMO the official Go recommendations are much stronger than the opinion of a community member who wrote a linter. Do you think otherwise?

I'd go without _test.

I see the point of _test convention to force only exported functions to be tested (it's briefly mentioned here https://tip.golang.org/cmd/go/#hdr-Test_packages and it's used in the go standard library).

In this case it doesn't really matter because we're in a separate package anyway and we're not doing unit tests.

Thanks @iaguis. I wasn't aware _test in package names was a thing, and couldn't figure this out from the discussion above. So IIUC _test is a special case, i.e. there is no recommendation to call a package cluster_configuration. Right?

Yes, IIUC this is only for tests, not for package names in general.

If I should go without _test (though I think it does not matter here and it's a good habit to keep it), what should I do with the linter? Do we want to disable the rule globally? Or should each new test file get //nolint:testpackage?

Just added //nolint:testpackage for now.

test/system/ssh_keys_test.go

johananl · 2020-05-25T17:55:50Z

Looks like the cleanup doesn't work - I still have the DS on the cluster following a failed test run. Could this be caused by the following?

lokomotive/test/components/util/util.go

Line 124 in 2962730

t.Errorf("error while waiting for the daemonset: %v", err)

johananl · 2020-05-25T18:08:02Z

I've checked /home/core/.ssh/authorized_keys on the nodes. It contains the key specified by Ignition but also an empty line. This could be a hint.

Also, we can probably improve the test: right now we are determining pass/fail based on the convergence of the DS. This has the following problems:

In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.
We don't get meaningful output regarding the reason for the failure.

I suggest the following changes:

Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.
Perform the test by spawning the pod using a dummy command (e.g. sleep), then running cat /home/core/.ssh/authorized_keys interactively using the k8s client while collecting its output. Alternatively we could run a k8s job with the command in the manifest assuming we can gather the output later. This would allow us to run test logic in Go code and provide a meaningful error in case of failure.

johananl · 2020-05-25T18:15:01Z

Ah, the problem is that the # auto-generated by update-ssh-keys doesn't exist in the authorized keys file on disk. Is it supposed to be there?

invidian · 2020-05-27T12:11:40Z

Ah, the problem is that the # auto-generated by update-ssh-keys doesn't exist in the authorized keys file on disk. Is it supposed to be there?

Ah, with unit disabled, we no longer get this line. I was modifying the file manually to make the test pass. I didn't have time to look up CI results. Will remove that then.

Looks like the cleanup doesn't work - I still have the DS on the cluster following a failed test run. Could this be caused by the following?

IIRC the cleanup only takes place if test succeeds, so you can investigate it. Do you think we should always do the cleanup?

Regarding the tests failing, I think we must run on at least one worker and one controller node, as there are 2 knobs in the code, which are responsible for it (#449). The easiest way to do that is via DaemonSet. Unfortunately, DaemonSet does not support restartPolicy: Never, which would allow to determine the state quickly.

Regarding the failing test reason, given that we run tests on CI, I wouldn't print the content of authorized_keys in the test logs, as it may leak e.g. email addresses. I think just reporting SHA512 mismatch and skipping the cleanup is reasonable here, so you can go and investigate.

Perform the test by spawning the pod using a dummy command (e.g. sleep), then running cat /home/core/.ssh/authorized_keys interactively using the k8s client while collecting its output. Alternatively we could run a k8s job with the command in the manifest assuming we can gather the output later. This would allow us to run test logic in Go code and provide a meaningful error in case of failure.

I wanted to avoid doing more interactive things, as they seem overly complicated to use in testing for me.

Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.

This would make sense, but then I need to iterate over all the nodes, create pods there etc.

invidian · 2020-05-27T13:22:03Z

In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.

Also, we should be optimizing the passing case, not failing, as ideally it should never be failing. If it fails, IMO it's fine for it to happen slowly.

invidian · 2020-05-27T13:39:44Z

Tests are passing, except flaky ARM job.

surajssd · 2020-05-28T11:19:53Z

Now all of them fail on bootkube 🤔

johananl · 2020-05-28T11:32:52Z

In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.

Also, we should be optimizing the passing case, not failing, as ideally it should never be failing. If it fails, IMO it's fine for it to happen slowly.

I disagree. A slow test affects the run time of all CI builds. If we have an easy way to fail fast, I don't see why we should wait 5 minutes.

johananl · 2020-05-28T11:36:32Z

IIRC the cleanup only takes place if test succeeds, so you can investigate it. Do you think we should always do the cleanup?

I guess we can skip the cleanup on failure as long as doing so doesn't affect other tests.

johananl · 2020-05-28T11:37:48Z

Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.

This would make sense, but then I need to iterate over all the nodes, create pods there etc.

You won't have to iterate. AFAICT you only have to test one worker node and one controller node.

invidian · 2020-05-28T11:41:28Z

You won't have to iterate. AFAICT you only have to test one worker node and one controller node.

This means I need to find the node name and then run 2 independent tests, though very similar. This seems fragile to me and unnecessarily complicated.

I disagree. A slow test affects the run time of all CI builds. If we have an easy way to fail fast, I don't see why we should wait 5 minutes.

But unless you break things, you won't notice that, so it will only affect small % of the builds.

I guess we can skip the cleanup on failure as long as doing so doesn't affect other tests.

I don't think it should affect anything.

johananl · 2020-05-28T11:48:14Z

@invidian please go ahead with whatever makes sense to you regarding the test. My comments regarding the DaemonSet weren't meant as a "hard block". They were rather suggestions on how to make the test better in case you're still working on improving it.

iaguis

lgtm when green.

This commit disables syncing authorized SSH keys for core user on nodes from Packet's user's keys and project's keys, so only keys which are specified in the Lokomotive configuration are actually authorized, as having more keys allowed than specified in the configuration might be a potential security threat. Closes #465 Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>

This commit adds e2e test, to verify, that there is no extra SSH keys added to the Kubernetes nodes, other than one specified in the SSH_KEY environment variable, which is used by the CI for provisioning the clusters. Ideally, the test should pull the keys directly from the cluster configuration, however this is currently not trivial to implement. Refs #465 Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>

invidian · 2020-05-29T12:53:59Z

@iaguis tests are passing.

iaguis

lgtm

ipochi

lgtm , with a suggestion.

ipochi · 2020-05-29T13:35:09Z

test/system/ssh_keys_test.go

+	// Set the right arguments from the manifest with the desired SHA512 sum.
+	ds.Spec.Template.Spec.Containers[0].Args = []string{
+		"-c",
+		fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll


Suggested change

fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll

fmt.Sprintf("sha512sum \" +

"--status -c <(echo %s /home/core/.ssh/authorized_keys) && \"

"exec tail -f /dev/null", sum

),

would something like this be better to avoid //nolint:lll

I'm not sure if that improves the readability 😄

ipochi · 2020-05-29T13:36:26Z

@invidian please re-trigger the job, baremetal job failed for some weird reason.

invidian · 2020-05-29T13:37:13Z

@invidian please re-trigger the job, baremetal job failed for some weird reason.

It's done, it runs here: https://yard.lokomotive-k8s.net/teams/main/pipelines/lokomotive-oss-bare-metal-pr/jobs/test-lokomotive-deployment/builds/611.1

invidian mentioned this pull request May 22, 2020

Provide a framework to run node configuration tests #472

Open

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 22d1aa3 to efd1e56 Compare May 22, 2020 16:54

invidian requested review from johananl and surajssd May 22, 2020 17:39

iaguis suggested changes May 25, 2020

View reviewed changes

test/system/ssh_keys_test.go Outdated Show resolved Hide resolved

test/system/ssh_keys_test.go Outdated Show resolved Hide resolved

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from efd1e56 to 9a73a36 Compare May 25, 2020 14:41

invidian requested a review from iaguis May 25, 2020 14:42

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 9a73a36 to 48be61d Compare May 25, 2020 17:17

johananl reviewed May 25, 2020

View reviewed changes

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch 2 times, most recently from 8048022 to 992b55b Compare May 27, 2020 13:02

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 992b55b to ceb2e46 Compare May 27, 2020 13:39

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch 4 times, most recently from e459649 to e8df46e Compare May 28, 2020 09:20

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from e8df46e to 859f9eb Compare May 28, 2020 10:58

invidian requested a review from johananl May 28, 2020 11:10

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 859f9eb to 6b6f3d8 Compare May 28, 2020 15:43

iaguis previously approved these changes May 29, 2020

View reviewed changes

invidian added 2 commits May 29, 2020 14:18

invidian dismissed iaguis’s stale review via 72079dd May 29, 2020 12:18

invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 6b6f3d8 to 72079dd Compare May 29, 2020 12:18

iaguis approved these changes May 29, 2020

View reviewed changes

ipochi approved these changes May 29, 2020

View reviewed changes

invidian merged commit c5728c2 into master May 29, 2020

invidian deleted the invidian/packet-disable-ssh-keys-integration branch May 29, 2020 14:20

invidian mentioned this pull request Jun 9, 2020

Disable automatic authorization of SSH keys on Packet #365

Closed

-		fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll
+		fmt.Sprintf("sha512sum \" +
+		  "--status -c <(echo %s /home/core/.ssh/authorized_keys) && \"
+		  "exec tail -f /dev/null", sum
+		 ),

packet: disable syncing allowed SSH keys on nodes #471

packet: disable syncing allowed SSH keys on nodes #471

Conversation

invidian commented May 22, 2020

invidian commented May 22, 2020

johananl commented May 22, 2020

invidian commented May 22, 2020

johananl commented May 22, 2020 • edited Loading

invidian commented May 22, 2020

invidian commented May 22, 2020

iaguis left a comment

Choose a reason for hiding this comment

iaguis commented May 25, 2020

invidian commented May 25, 2020

johananl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johananl May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johananl May 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johananl commented May 25, 2020

johananl commented May 25, 2020

johananl commented May 25, 2020

invidian commented May 27, 2020

invidian commented May 27, 2020

invidian commented May 27, 2020

surajssd commented May 28, 2020

johananl commented May 28, 2020

johananl commented May 28, 2020

johananl commented May 28, 2020

invidian commented May 28, 2020

johananl commented May 28, 2020

iaguis left a comment

Choose a reason for hiding this comment

invidian commented May 29, 2020

iaguis left a comment

Choose a reason for hiding this comment

ipochi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipochi commented May 29, 2020

invidian commented May 29, 2020

johananl commented May 22, 2020 •

edited

Loading

johananl May 27, 2020 •

edited

Loading

johananl May 28, 2020 •

edited

Loading