Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

packet: disable syncing allowed SSH keys on nodes #471

Merged
merged 2 commits into from
May 29, 2020

Conversation

invidian
Copy link
Member

This commit disables syncing authorized SSH keys for core user on nodes
from Packet's user's keys and project's keys, so only keys which are
specified in the Lokomotive configuration are actually authorized, as
having more keys allowed than specified in the configuration might be a
potential security threat.

Closes #465

Signed-off-by: Mateusz Gozdek mateusz@kinvolk.io

@invidian
Copy link
Member Author

Big question here is, do we want to test it now, and if yes, how?

@johananl
Copy link
Member

Big question here is, do we want to test it now, and if yes, how?

By "test", do you mean "test automatically"? AFAICT all we have to do to test this change manually is deploy a cluster with one key specified in Ignition, unless I've completely missed your point.

@invidian
Copy link
Member Author

By "test", do you mean "test automatically"? AFAICT all we have to do to test this change manually is deploy a cluster with one key specified in Ignition, unless I've completely missed your point.

Yes, I meant "test automatically", testing manually does not count :) Anyway, I have an idea how to test that. CI takes SSH key from SSH_KEY environment variable, so I'll grab that and create a DaemonSet defined below from the test and then wait until it converges. I'd use Job, but it is not possible to run it on all nodes and this is something we should run on all of them:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test-ssh-keys
spec:
  selector:
    matchLabels:
      name: test-ssh-keys
  template:
    metadata:
      labels:
        name: test-ssh-keys
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: test-ssh-keys
        image: ubuntu
        command: ["bash"]
        args: ["-c", "md5sum --status -c <(echo e45cf23b986ef85576c1721602d6c0d3 /home/core/.ssh/authorized_keys) && tail -f /dev/null"]
        volumeMounts:
        - name: ssh
          mountPath: /home/core/.ssh
          readOnly: true
      volumes:
      - name: ssh
        hostPath:
          path: /home/core/.ssh

@johananl
Copy link
Member

johananl commented May 22, 2020

Where will this logic sit? Inside a Go test?

@invidian
Copy link
Member Author

Where will this logic sit? Inside a Go test?

Yes, in our e2e test suite.

@invidian
Copy link
Member Author

Implemented tests for that, it was fun!

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 22d1aa3 to efd1e56 Compare May 22, 2020 16:54
Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits.

test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from efd1e56 to 9a73a36 Compare May 25, 2020 14:41
@invidian invidian requested a review from iaguis May 25, 2020 14:42
@iaguis
Copy link
Contributor

iaguis commented May 25, 2020

This doesn't seem to be tested:

=== RUN   TestNoExtraSSHKeysOnNodes
=== PAUSE TestNoExtraSSHKeysOnNodes
=== CONT  TestNoExtraSSHKeysOnNodes
    TestNoExtraSSHKeysOnNodes: ssh_keys_test.go:93: "SSH_KEY" environment variable not set
--- SKIP: TestNoExtraSSHKeysOnNodes (0.00s)

I think the right env var is $PUB_KEY.

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 9a73a36 to 48be61d Compare May 25, 2020 17:17
@invidian
Copy link
Member Author

I think the right env var is $PUB_KEY.

Oops, fixed, thanks!

Copy link
Member

@johananl johananl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments (the main one is missing articles in comment strings).

I can't get the test to pass - the DS pods are crashlooping and there is no output in the logs. I've created a Packet cluster and used the following commands to run the test:

export PUB_KEY=$(cat key.pub)
go test -v -tags e2e,packet ./test/system -run TestNoExtraSSHKeysOnNodes

Output:

=== RUN   TestNoExtraSSHKeysOnNodes
=== PAUSE TestNoExtraSSHKeysOnNodes
=== CONT  TestNoExtraSSHKeysOnNodes
    TestNoExtraSSHKeysOnNodes: util.go:65: using KUBECONFIG=/home/johannes/kinvolk/clusters/packet-ssh-keys/assets/cluster-assets/auth/kubeconfig
    TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
    TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
    ...
    TestNoExtraSSHKeysOnNodes: util.go:134: error while waiting for the daemonset: timed out waiting for the condition
--- FAIL: TestNoExtraSSHKeysOnNodes (300.25s)
FAIL
FAIL	github.com/kinvolk/lokomotive/test/system	300.257s
FAIL

test/components/util/util.go Outdated Show resolved Hide resolved
const (
// RetryInterval is the default retry interval for tests using retry strategy, so they
// don't have to specify their own local intervals.
RetryInterval = time.Second * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why put these consts here? They are used in test/system/ssh_keys_test.go so I think they should live as close as possible to where they are used. I think we want to minimize our reliance on "util" packages, not extend it.

I'm assuming your rationale is to avoid duplication when new code is added in the future which may need to use retries. My suggestion is to do the refactoring if/when that happens, because then you will have the necessary info regarding the best place to put these consts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prepares a ground for a better code, by adding the const and then actually using them. Refactoring is out of scope of this PR. And I want to avoid adding yet another duplication.

This way, newly added code should use those const, meaning once we actually do the refactoring, we have less work to do.

Copy link
Member

@johananl johananl May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against adding the constants, I'm against adding them in a separate file because we might need to reuse them at some unknown point in the future.

Would you consider doing the change when it is actually needed rather than based on a theoretical predication of the future? The price we are paying by doing this right now is that we expand our reliance on "util" packages. If we put any code piece which we may be able to reuse in the future in a "util" package, we'd gradually be moving most of our logic there.

If you do the optimization once you encounter a concrete problem (e.g. "oh, I need to duplicate this code to make it work"), you will have a better idea where to put this code based on the concrete case you have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to duplicate this code already:

$ git grep 'time.Minute\*5' test/
test/components/cert-manager/cert-manager_test.go:                      testutil.WaitForDeployment(t, client, namespace, test.deployment, time.Second*5, time.Minute*5)
test/components/contour/contour_test.go:        testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/contour/contour_test.go:        testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/coredns/coredns_test.go:        testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/dex/dex_test.go:                testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/external-dns/external-dns_test.go:              testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/flatcar-linux-update-operator/fluo_test.go:     testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/flatcar-linux-update-operator/fluo_test.go:     testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/gangway/gangway_test.go:                testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/metallb/metallb_test.go:                testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/metallb/metallb_test.go:                testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/metrics-server/metrics_server_test.go:          testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/openebs-operator/openebs_operator_test.go:                      testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/rook/rook_test.go:              testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)

This change will prevent exactly that.

Copy link
Member

@johananl johananl May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you are bothered by "duplicating" the expression time.Minute*5 inside test/components. I don't understand how the test/components package is related to this PR (which touches SSH keys on Packet). I also don't think that repeating time.Minute*5 can be considered code duplication, unless we are assuming that all these tests must use the same value there. If the "duplication" in test/components is what we're discussing here, I'd say it's completely out of scope and move that part into a separate PR.

FWIW, if we decide that all tests should use the same time.Minute*5 value, the way I would solve it is by defining the constant in test/components and avoid the util package. To be precise, I'd put the following code in test/components/components.go (or even util.go if you insist):

package components

const (
    RetryInterval = time.Second * 5
)

For this PR I would not use a constant that's intended for using by components, because the code in this PR isn't related to components. In case you are arguing that we should use one global value for all tests, I would simply define the constant in the test package and import it into any "child" package under test/.

Lastly, introducing some duplication is IMO much cheaper than adding an import between two otherwise-unrelated packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case I wasn't able to convince you with the above, please go ahead with whatever works for you. We need this change to get merged and I don't want to block it any further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I agree with @johananl here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the constants.

// +build aws aws_edge baremetal packet
// +build e2e

package system_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we call this package system. It's already under test/ so test/system_test repeats "test" unnecessarily (see the "do not stutter" recommendation). Also, underscores or camel case are usually discouraged in the Go community. If you feel it's absolutely necessary to include "test" in the package name, I would consider systest or systemtest to make the package name more idiomatic. Relevant info: https://blog.golang.org/package-names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to satisfy new linter, which IMO we should incorporate also for other code: https://github.com/maratori/testpackage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we agreed on using this new linter? I'm not familiar with this tool. Anyway, if that's what this linter is suggesting, I'd say we need to reconsider using it :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, we should be following best practices and this is one of them. The linter enforces them. If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.

In this particular case, using this package name for unit tests enforces only testing exported functions, which allows to do refactoring in the future. It does not make much sense for e2e tests though, but I don't see a reason why we should make a distinction here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, we should be following best practices and this is one of them. The linter enforces them.

Right. The relevant Go best practice is here: https://blog.golang.org/package-names
Where is the best practice which says system_test is preferred? :-/

If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.

Following that logic, shouldn't we have a team decision to add a linter, too? In any case, IMO the official Go recommendations are much stronger than the opinion of a community member who wrote a linter. Do you think otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go without _test.

I see the point of _test convention to force only exported functions to be tested (it's briefly mentioned here https://tip.golang.org/cmd/go/#hdr-Test_packages and it's used in the go standard library).

In this case it doesn't really matter because we're in a separate package anyway and we're not doing unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iaguis. I wasn't aware _test in package names was a thing, and couldn't figure this out from the discussion above. So IIUC _test is a special case, i.e. there is no recommendation to call a package cluster_configuration. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IIUC this is only for tests, not for package names in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I should go without _test (though I think it does not matter here and it's a good habit to keep it), what should I do with the linter? Do we want to disable the rule globally? Or should each new test file get //nolint:testpackage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added //nolint:testpackage for now.

test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
test/system/ssh_keys_test.go Outdated Show resolved Hide resolved
@johananl
Copy link
Member

Looks like the cleanup doesn't work - I still have the DS on the cluster following a failed test run. Could this be caused by the following?

t.Errorf("error while waiting for the daemonset: %v", err)

@johananl
Copy link
Member

I've checked /home/core/.ssh/authorized_keys on the nodes. It contains the key specified by Ignition but also an empty line. This could be a hint.

Also, we can probably improve the test: right now we are determining pass/fail based on the convergence of the DS. This has the following problems:

  • In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.
  • We don't get meaningful output regarding the reason for the failure.

I suggest the following changes:

  • Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.
  • Perform the test by spawning the pod using a dummy command (e.g. sleep), then running cat /home/core/.ssh/authorized_keys interactively using the k8s client while collecting its output. Alternatively we could run a k8s job with the command in the manifest assuming we can gather the output later. This would allow us to run test logic in Go code and provide a meaningful error in case of failure.

@johananl
Copy link
Member

Ah, the problem is that the # auto-generated by update-ssh-keys doesn't exist in the authorized keys file on disk. Is it supposed to be there?

@invidian
Copy link
Member Author

Ah, the problem is that the # auto-generated by update-ssh-keys doesn't exist in the authorized keys file on disk. Is it supposed to be there?

Ah, with unit disabled, we no longer get this line. I was modifying the file manually to make the test pass. I didn't have time to look up CI results. Will remove that then.

Looks like the cleanup doesn't work - I still have the DS on the cluster following a failed test run. Could this be caused by the following?

IIRC the cleanup only takes place if test succeeds, so you can investigate it. Do you think we should always do the cleanup?

Regarding the tests failing, I think we must run on at least one worker and one controller node, as there are 2 knobs in the code, which are responsible for it (#449). The easiest way to do that is via DaemonSet. Unfortunately, DaemonSet does not support restartPolicy: Never, which would allow to determine the state quickly.

Regarding the failing test reason, given that we run tests on CI, I wouldn't print the content of authorized_keys in the test logs, as it may leak e.g. email addresses. I think just reporting SHA512 mismatch and skipping the cleanup is reasonable here, so you can go and investigate.

Perform the test by spawning the pod using a dummy command (e.g. sleep), then running cat /home/core/.ssh/authorized_keys interactively using the k8s client while collecting its output. Alternatively we could run a k8s job with the command in the manifest assuming we can gather the output later. This would allow us to run test logic in Go code and provide a meaningful error in case of failure.

I wanted to avoid doing more interactive things, as they seem overly complicated to use in testing for me.

Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.

This would make sense, but then I need to iterate over all the nodes, create pods there etc.

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch 2 times, most recently from 8048022 to 992b55b Compare May 27, 2020 13:02
@invidian
Copy link
Member Author

In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.

Also, we should be optimizing the passing case, not failing, as ideally it should never be failing. If it fails, IMO it's fine for it to happen slowly.

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 992b55b to ceb2e46 Compare May 27, 2020 13:39
@invidian
Copy link
Member Author

Tests are passing, except flaky ARM job.

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch 4 times, most recently from e459649 to e8df46e Compare May 28, 2020 09:20
@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from e8df46e to 859f9eb Compare May 28, 2020 10:58
@invidian invidian requested a review from johananl May 28, 2020 11:10
@surajssd
Copy link
Member

Now all of them fail on bootkube 🤔

@johananl
Copy link
Member

In case there is a SHA mismatch, we wait for 5 minutes rather than failing immediately.

Also, we should be optimizing the passing case, not failing, as ideally it should never be failing. If it fails, IMO it's fine for it to happen slowly.

I disagree. A slow test affects the run time of all CI builds. If we have an easy way to fail fast, I don't see why we should wait 5 minutes.

@johananl
Copy link
Member

IIRC the cleanup only takes place if test succeeds, so you can investigate it. Do you think we should always do the cleanup?

I guess we can skip the cleanup on failure as long as doing so doesn't affect other tests.

@johananl
Copy link
Member

Run an ad-hoc pod rather than a DS. We don't support per-node SSH keys so we can safely assume that if one node is in a good state, the rest of them are. If necessary, we could run a separate test for controller nodes.

This would make sense, but then I need to iterate over all the nodes, create pods there etc.

You won't have to iterate. AFAICT you only have to test one worker node and one controller node.

@invidian
Copy link
Member Author

You won't have to iterate. AFAICT you only have to test one worker node and one controller node.

This means I need to find the node name and then run 2 independent tests, though very similar. This seems fragile to me and unnecessarily complicated.

I disagree. A slow test affects the run time of all CI builds. If we have an easy way to fail fast, I don't see why we should wait 5 minutes.

But unless you break things, you won't notice that, so it will only affect small % of the builds.

I guess we can skip the cleanup on failure as long as doing so doesn't affect other tests.

I don't think it should affect anything.

@johananl
Copy link
Member

@invidian please go ahead with whatever makes sense to you regarding the test. My comments regarding the DaemonSet weren't meant as a "hard block". They were rather suggestions on how to make the test better in case you're still working on improving it.

@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 859f9eb to 6b6f3d8 Compare May 28, 2020 15:43
iaguis
iaguis previously approved these changes May 29, 2020
Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm when green.

This commit disables syncing authorized SSH keys for core user on nodes
from Packet's user's keys and project's keys, so only keys which are
specified in the Lokomotive configuration are actually authorized, as
having more keys allowed than specified in the configuration might be a
potential security threat.

Closes #465

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit adds e2e test, to verify, that there is no extra SSH keys
added to the Kubernetes nodes, other than one specified in the SSH_KEY
environment variable, which is used by the CI for provisioning the
clusters.

Ideally, the test should pull the keys directly from the cluster
configuration, however this is currently not trivial to implement.

Refs #465

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian invidian force-pushed the invidian/packet-disable-ssh-keys-integration branch from 6b6f3d8 to 72079dd Compare May 29, 2020 12:18
@invidian
Copy link
Member Author

@iaguis tests are passing.

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@ipochi ipochi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm , with a suggestion.

// Set the right arguments from the manifest with the desired SHA512 sum.
ds.Spec.Template.Spec.Containers[0].Args = []string{
"-c",
fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll
fmt.Sprintf("sha512sum \" +
"--status -c <(echo %s /home/core/.ssh/authorized_keys) && \"
"exec tail -f /dev/null", sum
),

would something like this be better to avoid //nolint:lll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that improves the readability 😄

@ipochi
Copy link
Member

ipochi commented May 29, 2020

@invidian please re-trigger the job, baremetal job failed for some weird reason.

@invidian
Copy link
Member Author

@invidian please re-trigger the job, baremetal job failed for some weird reason.

It's done, it runs here: https://yard.lokomotive-k8s.net/teams/main/pipelines/lokomotive-oss-bare-metal-pr/jobs/test-lokomotive-deployment/builds/611.1

@invidian invidian merged commit c5728c2 into master May 29, 2020
@invidian invidian deleted the invidian/packet-disable-ssh-keys-integration branch May 29, 2020 14:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packet: handle automatic authorization of all API-level SSH keys
5 participants