Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the comprehensiveness of check --pre #3701

Open
wants to merge 1 commit into
base: master
from
Open

Conversation

@tegioz
Copy link
Collaborator

tegioz commented Nov 12, 2019

A new check called can create non-namespaced resources has been added to replace the individual can create <RESOURCE> checks related to non-namespaced resources.

Before, each of those can create checks was asking k8s (SelfSubjectAccessReview) if the user was authorized to create the given resource. Now, the non-namespaced resources are actually created but using the DryRun option so that objects are not persisted. This way we should be able to catch more details about potential problems preventing the installation to proceed successfully.

This check is fed from the generated installation manifest using default options. This means that as new non-namespaced resources are added to the installation, this check will test if it's possible to create them automatically without requiring additional modifications.

Closes #3224

Signed-off-by: Sergio Castaño Arteaga tegioz@icloud.com

Screenshot 2019-11-13 at 10 43 00

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 12, 2019

Hi @grampelberg

I've started addressing the points described in #3224 in this draft PR. Is something like this what you had in mind for the first part of the ticket?

Some problems/questions I'd like to discuss:

  • DryRun seems to require the target namespace to exist, so the check is creating it and deleting it on completion. I'm not very happy with this as it's possible (just Ctrl-C while the check is running) that the namespace is not cleaned up. Similarly, deleting the namespace takes a few seconds, so running check --pre immediately after it finishes raises an issue as the namespace is already there. After a couple of seconds it goes away, but it's a problem. Need to think about a better and safer way of handling this.

  • Discrepancies about the linkerd install output used by the tests and the potential output of the command when run using some options. I don't know how critical this is, but ideally it'd be great to have a chance to test the resulting install manifest that user would like to apply, including all the configuration changes applied, and not just a generic one.

@tegioz tegioz added the area/cli label Nov 12, 2019
@tegioz tegioz self-assigned this Nov 12, 2019
@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Nov 12, 2019

DryRun seems to require the target namespace to exist, so the check is creating it and deleting it on completion. I'm not very happy with this as it's possible (just Ctrl-C while the check is running) that the namespace is not cleaned up. Similarly, deleting the namespace takes a few seconds, so running check --pre immediately after it finishes raises an issue as the namespace is already there. After a couple of seconds it goes away, but it's a problem. Need to think about a better and safer way of handling this.

So far, the CLI only reads things on your cluster. Having it create/delete a namespace definitely isn't fantastic. As it would be possible for someone to just linkerd install | kubectl apply --server-dry-run, maybe we should make this a little more focused? WDYT about splitting the resources into cluster level and namespace level? For cluster level, we do a dry run, and for namespace level we leave it the way it is today?

Discrepancies about the linkerd install output used by the tests and the potential output of the command when run using some options. I don't know how critical this is, but ideally it'd be great to have a chance to test the resulting install manifest that user would like to apply, including all the configuration changes applied, and not just a generic one.

At this point I think we're just getting into the territory of having apply --server-dry-run. Can you think of some good ways to nudge users into running that before doing the apply? Maybe some output in install and check --pre?

@tegioz tegioz force-pushed the improve-check-pre branch from ce5d143 to e93311c Nov 13, 2019
@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 13, 2019

So far, the CLI only reads things on your cluster. Having it create/delete a namespace definitely isn't fantastic. As it would be possible for someone to just linkerd install | kubectl apply --server-dry-run, maybe we should make this a little more focused? WDYT about splitting the resources into cluster level and namespace level? For cluster level, we do a dry run, and for namespace level we leave it the way it is today?

That sounds good to me. I've just updated the PR to do that. I think this version adds some value without requiring any namespace creation/deletion. Thanks!

At this point I think we're just getting into the territory of having apply --server-dry-run. Can you think of some good ways to nudge users into running that before doing the apply? Maybe some output in install and check --pre?

Yes, I agree, it'd be very similar to run apply --server-dry-run manually, with the advantage of having it integrated into check and being able to set it up properly so that it works (create required namespace). We have to keep in mind that running apply --server-dry-run manually also requires the namespace to exist so that namespaced resources can be created with dry-run. So I think we can't just recommend users to pipe the output of install into apply --server-dry-run without some extra setup.

@zaharidichev

This comment has been minimized.

Copy link
Member

zaharidichev commented Nov 13, 2019

WDYT about splitting the resources into cluster level and namespace level? For cluster level, we do a dry run, and for namespace level we leave it the way it is today?

I we really want to test the namespaced resources using the same approach as in this PR I think there are two options that I can come up with:

  1. Create a namespace that has a name that is randomly generated, thus lowering the probability of a namespace collision dramatically (while the delete takes place)
  2. Create the namespace and block until we verify it has been deleted successfully (not sure whether the overhead of that is problematic)

@grampelberg @tegioz WDYT?

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 13, 2019

Thanks @zaharidichev, we could improve it using any of those options, but I'm not sure if they would solve the problem completely.

  1. Create a namespace that has a name that is randomly generated, thus lowering the probability of a namespace collision dramatically (while the delete takes place)

What if the user running the check has been granted permissions only to a certain namespace(s)? The check would fail but it might have succeeded if the provided namespace (check --linkerd-namespace) or the default one would have been used. There could be also some admission webhooks applied only to certain namespaces. If possible, I think it'd be better to run this check using the namespace provided by the user.

  1. Create the namespace and block until we verify it has been deleted successfully (not sure whether the overhead of that is problematic)

Blocking would help, but if the check is interrupted or killed the namespace would never be deleted. The interruption could be handled, but not the kill.

When creating the namespace for checking purposes, we could add some sort of check annotation on it. When running check or install we could try to check if the target (or any, actually) namespace exists with that annotation, and clean it up before doing any further processing. It's not very nice and it'd still be possible to leave that namespace in the cluster, but it would solve the problem partially too.

@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Nov 13, 2019

To keep the scope of this specific PR down, I'd prefer to not create the namespace as that's a larger product decision.

It does sound valuable to have a tool that does that, though. I'm in the process of thinking through what a set of diagnostic tools would look like and I'll make sure to add this to the list! Any recommendations along those lines would be greatly appreciated =)

@tegioz tegioz added this to In Progress in L5D-Dashboard-EVOL via automation Nov 13, 2019
@tegioz tegioz marked this pull request as ready for review Nov 13, 2019
@tegioz tegioz moved this from In Progress to Pending Reviewer Approval in L5D-Dashboard-EVOL Nov 13, 2019
Copy link
Member

adleong left a comment

This is the output I get when I don't have the correct permissions:

pre-kubernetes-setup
--------------------
× control plane namespace does not already exist
    namespaces "linkerd" is forbidden: User "obama@buoyant.io" cannot get resource "namespaces" in API group "" in the namespace "linkerd"
    see https://linkerd.io/checks/#pre-ns for hints
× can create non-namespaced resources
    cannot create Namespace/linkerd: namespaces is forbidden: User "obama@buoyant.io" cannot create resource "namespaces" in API group "" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-identity: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-identity: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-controller: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-controller: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-destination: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-destination: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-web-admin: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create CustomResourceDefinition/serviceprofiles.linkerd.io: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
    cannot create CustomResourceDefinition/trafficsplits.split.smi-spec.io: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-prometheus: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-prometheus: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-proxy-injector: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-proxy-injector: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create MutatingWebhookConfiguration/linkerd-proxy-injector-webhook-config: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-sp-validator: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-sp-validator: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ValidatingWebhookConfiguration/linkerd-sp-validator-webhook-config: validatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-tap: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-tap-admin: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-tap: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-tap-auth-delegator: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create APIService/v1alpha1.tap.linkerd.io: apiservices.apiregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
    cannot create PodSecurityPolicy/linkerd-linkerd-control-plane: podsecuritypolicies.policy is forbidden: User "obama@buoyant.io" cannot create resource "podsecuritypolicies" in API group "policy" at the cluster scope
    see https://linkerd.io/checks/#pre-k8s-cluster-k8s for hints
× can create ServiceAccounts
    not authorized to access serviceaccounts: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Services
    not authorized to access services: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Deployments
    not authorized to access deployments.apps: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create CronJobs
    not authorized to access cronjobs.batch: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create ConfigMaps
    not authorized to access configmaps: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Secrets
    not authorized to access secrets: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can read Secrets
    not authorized to access secrets: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× no clock skew detected
    nodes is forbidden: User "obama@buoyant.io" cannot list resource "nodes" in API group "" at the cluster scope
    see https://linkerd.io/checks/#pre-k8s-clock-skew for hints

This is a pretty scary wall of text, and a lot of the information there is highly redundant. Can you think of any way to present this in a more clear way?

pkg/healthcheck/healthcheck.go Show resolved Hide resolved
@adleong

This comment has been minimized.

Copy link
Member

adleong commented Nov 13, 2019

This actually looks worse in my terminal because of line wrapping and how wide the error messages are:

Screen Shot 2019-11-13 at 2 06 10 PM

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 14, 2019

This is the output I get when I don't have the correct permissions:

pre-kubernetes-setup
--------------------
× control plane namespace does not already exist
    namespaces "linkerd" is forbidden: User "obama@buoyant.io" cannot get resource "namespaces" in API group "" in the namespace "linkerd"
    see https://linkerd.io/checks/#pre-ns for hints
× can create non-namespaced resources
    cannot create Namespace/linkerd: namespaces is forbidden: User "obama@buoyant.io" cannot create resource "namespaces" in API group "" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-identity: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-identity: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-controller: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-controller: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-destination: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-destination: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-web-admin: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create CustomResourceDefinition/serviceprofiles.linkerd.io: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
    cannot create CustomResourceDefinition/trafficsplits.split.smi-spec.io: customresourcedefinitions.apiextensions.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "customresourcedefinitions" in API group "apiextensions.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-prometheus: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-prometheus: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-proxy-injector: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-proxy-injector: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create MutatingWebhookConfiguration/linkerd-proxy-injector-webhook-config: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-sp-validator: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-sp-validator: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ValidatingWebhookConfiguration/linkerd-sp-validator-webhook-config: validatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-tap: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRole/linkerd-linkerd-tap-admin: clusterroles.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterroles" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-tap: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create ClusterRoleBinding/linkerd-linkerd-tap-auth-delegator: clusterrolebindings.rbac.authorization.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "clusterrolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
    cannot create APIService/v1alpha1.tap.linkerd.io: apiservices.apiregistration.k8s.io is forbidden: User "obama@buoyant.io" cannot create resource "apiservices" in API group "apiregistration.k8s.io" at the cluster scope
    cannot create PodSecurityPolicy/linkerd-linkerd-control-plane: podsecuritypolicies.policy is forbidden: User "obama@buoyant.io" cannot create resource "podsecuritypolicies" in API group "policy" at the cluster scope
    see https://linkerd.io/checks/#pre-k8s-cluster-k8s for hints
× can create ServiceAccounts
    not authorized to access serviceaccounts: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Services
    not authorized to access services: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Deployments
    not authorized to access deployments.apps: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create CronJobs
    not authorized to access cronjobs.batch: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create ConfigMaps
    not authorized to access configmaps: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can create Secrets
    not authorized to access secrets: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× can read Secrets
    not authorized to access secrets: no RBAC policy matched
    see https://linkerd.io/checks/#pre-k8s for hints
× no clock skew detected
    nodes is forbidden: User "obama@buoyant.io" cannot list resource "nodes" in API group "" at the cluster scope
    see https://linkerd.io/checks/#pre-k8s-clock-skew for hints

This is a pretty scary wall of text, and a lot of the information there is highly redundant. Can you think of any way to present this in a more clear way?

HI @adleong

We could check the error of each of the cases in can create non-namespaced resources and print it using a nicer message. In your particular case, I think checking if it's a forbidden error would suffice. We would be hiding the specific error and showing something like the not authorized to create ... : no RBAC policy matched I guess.

Note that creating the object using dry-run is not a simple check of permissions like the previous tests, so any other kind of error may be raised. So we could catch some expected ones and print nicely formatted messages for them (like this one), but it's still possible that for some errors we display some not nice messages. We can have a catch all one too, but we could be hiding some important details about the error.

I agree they don't look nice, but they can be useful to debug an issue. And in the end they are generic k8s errors users may be already familiar with. I also agree the output is highly redundant in this case, but it's not much different than in the can create checks below: not authorized to access ... : no RBAC policy matched. Maybe we can play a bit with the --verbose flag and only display some messages when it is provided.

Happy to proceed either way :)

Related to this.. Do you think it could be useful to provide a reference yaml file with a role and cluster role including the permissions required to install Linkerd? In existing checks, when it's not possible to create a given resource, users are recommended to see https://linkerd.io/checks/#pre-k8s for hints. I was thinking that even though that documentation section gives a lot of useful information, if a cluster admin wanted to grant permissions to a different user to install Linkerd, it could be easier for him to inspect and apply them from a yaml file directly. Some resources are cluster level, some would be created in the namespace assigned to Linkerd, and it can be a bit of work to put them together. Maybe this already exists and I haven't seen it yet though.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 14, 2019

@adleong another solution could be to group similar errors omitting the concrete resource name. Let's say we get a forbidden error creating a ClusterRole. We would only print the forbidden error for that resource kind, omitting the name. If the same error happens again on a different object of the same resource kind, we don't print anything.

Personally I still prefer printing errors per specific resource, as I don't know if in some cases we could get errors on a resource of a given kind and not get them for another one of the same kind for whatever reason. Without the specific resource name in the error, it may be harder to know what's going on. And showing them grouped sometimes and sometimes not may be a bit inconsistent.

@tegioz tegioz force-pushed the improve-check-pre branch 2 times, most recently from e63c4d4 to 9d6148a Nov 15, 2019
@adleong

This comment has been minimized.

Copy link
Member

adleong commented Nov 25, 2019

After thinking about this some more, I have the same philosophical objection here as I do in #3708

Our current philosophy is that the CLI never directly initiates a write operation, and instead leaves all write operations for the user to invoke with kubectl. Even with the dry-run flag, I think this is a violation of that assumption.

I think that assumption is worth re-visiting and re-evaluating if there is a compelling scenario that requires it, but I don't feel like an incremental improvement to check is that.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Nov 27, 2019

I understand, makes sense. Should I close this PR and #3708 for now?

@grampelberg grampelberg self-assigned this Dec 2, 2019
@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 3, 2019

I think for both #3708 and this PR, we first need to separate checks into two categories:

  • checks which perform no write operations of any kind (even dry-run writes), require no RBAC, and are safe to perform in the dashboard
  • checks which do required RBAC or perform dry-run writes

This allows us to add a flag which lets users running linkerd check opt-out of the first category of checks, and allows us to turn off those checks when run by the dashboard. I am okay with the dry-run checks added here and in #3708 if they are put into that second category.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 3, 2019

Awesome, happy to make any required changes to make them fit :)

However, please note that some of the current checks already require RBAC (although only get/list permissions). We had to add some rules to make them work from the dashboard, as it uses a service account with limited privileges. The same applies to the CLI if the user running linkerd check doesn't have the required permissions.

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 3, 2019

@tegioz ah, you're right. I should have said that the first category of checks should require no write RBAC. get/list is okay in my opinion.

@tegioz tegioz moved this from Pending Reviewer Approval to In Progress in L5D-Dashboard-EVOL Dec 4, 2019
@tegioz tegioz force-pushed the improve-check-pre branch from 9d6148a to c8cc37c Dec 4, 2019
@grampelberg grampelberg assigned adleong and unassigned tegioz Dec 9, 2019
@grampelberg grampelberg assigned adleong and unassigned adleong and grampelberg Dec 9, 2019
Copy link
Member

adleong left a comment

This looks great! I wonder if we should delete all of the pre-linkerd-global-resources checks now, since if the can create non-namespaced resources checks will now fail if those resources already exist.

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 10, 2019

What is the plan for users without the dry-run feature? I know we discussed this a bit on #3708 but I'm not sure what conclusion we reached.

I think the ideal behavior would be to catch and handle the the dryRun alpha feature is disabled errors and if we get that error either fall back to the previous behavior with subjectaccessreviews or print a message saying that the check was skipped because dryRun was not available.

@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Dec 10, 2019

@adleong what's the version support of dryRun? My impression was that it was on by default in 1.12 which is our minimum supported version right now. Given the k8s version support matrix, I'd be good bumping the minimum version to 1.14 at this point as 1.17 just went out.

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 10, 2019

I believe it is alpha in 1.12 and beta (and on by default) in 1.13. My dev cluster is 1.12 and doesn't have it.

As a comment that's not specific to Linkerd: the Kubernetes deprecation cycle feels insanely fast to me. Are folks really expected to upgrade their Kubernetes version every year?

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 11, 2019

This looks great! I wonder if we should delete all of the pre-linkerd-global-resources checks now, since if the can create non-namespaced resources checks will now fail if those resources already exist.

That's a good point, I think they may not be needed anymore 👍

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 11, 2019

A quick summary of the status of this PR just in case you guys still want to move forward with it :)

When I first created it I mentioned a couple of problems to be discussed. I'll focus on the first one as the second one was a matter of choice and could probably be solved cleanly.

  • DryRun seems to require the target namespace to exist, so the check is creating it and deleting it on completion. I'm not very happy with this as it's possible (just Ctrl-C while the check is running) that the namespace is not cleaned up. Similarly, deleting the namespace takes a few seconds, so running check --pre immediately after it finishes raises an issue as the namespace is already there. After a couple of seconds it goes away, but it's a problem. Need to think about a better and safer way of handling this.

A possible solution would be something like this (proposed during the discussion along with some others):

When creating the namespace for checking purposes, we could add some sort of check annotation on it. When running check or install we could try to check if the target (or any, actually) namespace exists with that annotation, and clean it up before doing any further processing. It's not very nice and it'd still be possible to leave that namespace in the cluster, but it would solve the problem partially too.

However, @grampelberg had some concerns about creating the namespace.

To keep the scope of this specific PR down, I'd prefer to not create the namespace as that's a larger product decision.

At the moment, I think this is the main blocker for this PR, as we need to create the namespace to proceed with this new check. I understand the reasoning behind not creating it as it's not something very nice and it feels intrusive, even if it's only temporarily. So unless we change our mind about this, we'll probably need to close this PR.

In addition to this, @adleong had some concerns about if checks should be performing any dry-run writes. He suggested to separate the checks in different categories based on permissions requirements, which has been addressed in #3708, which also relies on dry-run. I started with that PR because I felt it was closer to be merged, as in that one we don't have the namespaces limitation discussed above.

I think this covers the most important details, hope it helps. Please let me know how would you like to proceed. Thanks!

@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Dec 11, 2019

@tegioz could we do the dry run against default? What are the potential downsides to that?

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 11, 2019

Yes, we could do that. The user running the check might have only been granted privileges to create the resources in the linkerd namespace though, making the check fail even though the real installation might work. So the check wouldn't be as reliable as if we were using the final one. But it's definitely a valid option.

@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Dec 11, 2019

Hmmm, all good points. I'd be a little worried about false positives. Here's an alternative - WDYT about doing the dry run only for cluster level resources? Those are the most likely to have issues in my experience.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 11, 2019

I had made that change already @grampelberg after your initial comment, I forgot. Sorry, my bad. Current version of the code in the PR is skipping namespaced resources 🤦‍♂ .

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 11, 2019

my only remaining concern on this PR is that we decide how to handle clusters without dry run.

I think that means either bumping our minimum version or handling the errors when dry run is not available.

@grampelberg

This comment has been minimized.

Copy link
Member

grampelberg commented Dec 11, 2019

We already have issues on clusters without dry run (from another PR months ago). So I'm +1 at bumping the minimum version to 1.13.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 11, 2019

This looks great! I wonder if we should delete all of the pre-linkerd-global-resources checks now, since if the can create non-namespaced resources checks will now fail if those resources already exist.

@adleong I was checking LinkerdPreInstallGlobalResourcesChecks to see if they could be removed as suggested but I've realised they are called when running linkerd install as well to determine if the install config stage succeeded, not only from linkerd check.

@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 12, 2019

If in the end we are happy using dry-run without splitting the tests in different categories I can revert this commit I added recently to #3708.

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Dec 12, 2019

@tegioz I think we can simply remove LinkerdPreInstallGlobalResourcesChecks from the list of checks in check.go. This will mean these checks won't be used by linkerd check but can still be used by linkerd install.

Ignoring #3708 for a moment and just looking at this PR in isolation, I think the categories are good as they are here because the dryRun check is in the LinkerdPreInstallChecks category which is only run by the linkerd check --pre CLI command (and not ever run by the web dashboard)

@tegioz tegioz force-pushed the improve-check-pre branch from c8cc37c to 0c8ff49 Dec 12, 2019
Closes #3224

Signed-off-by: Sergio Castaño Arteaga <tegioz@icloud.com>
@tegioz tegioz force-pushed the improve-check-pre branch from 0c8ff49 to 3d1d0cd Dec 12, 2019
@tegioz

This comment has been minimized.

Copy link
Collaborator Author

tegioz commented Dec 12, 2019

@tegioz I think we can simply remove LinkerdPreInstallGlobalResourcesChecks from the list of checks in check.go. This will mean these checks won't be used by linkerd check but can still be used by linkerd install.

Done, thanks! I've also rebased from master and fixed a method call that had changed since the PR was created.

Ignoring #3708 for a moment and just looking at this PR in isolation, I think the categories are good as they are here because the dryRun check is in the LinkerdPreInstallChecks category which is only run by the linkerd check --pre CLI command (and not ever run by the web dashboard)

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
L5D-Dashboard-EVOL
  
In Progress
5 participants
You can’t perform that action at this time.