Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add namespace status conditions #73405

Merged

Conversation

@wozniakjan
Copy link
Contributor

commented Jan 28, 2019

What type of PR is this?
/kind feature
/sig api-machinery

What this PR does / why we need it:
For users to understand what is preventing the namespace deletion and not remove kubernetes finalizer from namespaces: #64002, #60807, #66735,
#72244

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #70916

Special notes for your reviewer:
This is a parallel effort to #71457, the issue #70916 discusses both implementations.

Sample output

$ kubectl get -o yaml ns test
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2019-02-13T12:54:40Z"
  deletionTimestamp: "2019-02-13T12:55:11Z"
  name: test
  resourceVersion: "1150486"
  selfLink: /api/v1/namespaces/test
  uid: 870e9db8-2f8e-11e9-9c34-08002711add2
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2019-02-13T12:58:03Z"
    message: All content successfully deleted
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2019-02-13T12:55:16Z"
    message: 'Discovery failed for some groups, 2 failing: unable to retrieve the
      complete list of server APIs: mutators.kubedb.com/v1alpha1: the server is currently
      unable to handle the request, validators.kubedb.com/v1alpha1: the server is
      currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  phase: Terminating

Does this PR introduce a user-facing change?:

Add status condition to namespace resource

TODO:

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2019

Hi @wozniakjan. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@roycaihw

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

/cc @lavalamp

@k8s-ci-robot k8s-ci-robot requested a review from lavalamp Jan 31, 2019
@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

i agree with improving transparency on why a namespace is stuck terminating.

can we ensure the design documentation for namespaces are kept in sync:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/namespaces.md

in practice it would be good to know a listing of what resources are not fetchable so an admin could choose to manually remove the finalizer safely if required. its not clear from the existing pr if you had multiple aggregated APIs not responding, the condition gives you a safe enough response.

pkg/apis/core/types.go Outdated Show resolved Hide resolved
@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Feb 4, 2019

can we ensure the design documentation for namespaces are kept in sync:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/namespaces.md

thanks @derekwaynecarr, will make a PR to update the docs and keep it in sync with this PR

its not clear from the existing pr if you had multiple aggregated APIs not responding, the condition gives you a safe enough response.

my goal was to reflect the deleter error to status condition transparently. I'm not sure I understand your comment, would you like to have the status condition provide more enhanced feedback to the end user?

Copy link
Member

left a comment

Some initial comments

I'm really excited about getting visibility into the namespace lifecycle controller!

@wozniakjan wozniakjan force-pushed the wozniakjan:namespace_status_conditions branch from c43eddd to 133761c Feb 13, 2019
@k8s-ci-robot k8s-ci-robot added the size/L label Feb 13, 2019
@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Feb 13, 2019

@lavalamp I pushed some code where I was trying to address initial round of comments, quick digest from the conversation above

Is a condition really the best way of phrasing an estimate?

probably not, I dropped it from the status conditions. I think the estimate is only for pods

switch groupResource {
case schema.GroupResource{Group: "", Resource: "pods"}:
estimate, err = d.estimateGracefulTerminationForPods(ns)
}

and the default graceful termination should be 30s, so it could be acceptable to omit this feedback to an end user for now and make new PR later addressing the estimates

I recommend not having this function and figuring out the conditions you want before even calling updateStatusCondition

I like like your recommendation and I tried to implement this in newly added NamespaceConditionUpdater. This should be responsible for handling namespace conditions related to namespace deletion (currently there are no other conditions)

it'd be good if the error message called out the number of things that can't be deleted. It should also probably mention e.g. the alphabetically first one.

that sounds very reasonable, added that. I also split the condition into three based on the deleter error types to improve UX if multiple things failed.

Old example of status conditions
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2019-02-12T08:49:30Z"
  deletionTimestamp: "2019-02-12T08:51:05Z"
  name: test3
  resourceVersion: "1045153"
  selfLink: /api/v1/namespaces/test3
  uid: 1cc27c77-2ea3-11e9-8c82-08002711add2
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastProbeTime: "2019-02-12T12:37:47Z"
    lastTransitionTime: "2019-02-12T08:51:14Z"
    message: 'There are remaining resources, deletion error: [unable to retrieve the
      complete list of server APIs: mutators.kubedb.com/v1alpha1: the server is currently
      unable to handle the request, validators.kubedb.com/v1alpha1: the server is
      currently unable to handle the request, Internal error occurred: failed calling
      webhook "mysql.validators.kubedb.com": the server is currently unable to handle
      the request, Internal error occurred: failed calling webhook "redis.validators.kubedb.com":
      the server is currently unable to handle the request, Internal error occurred:
      failed calling webhook "elasticsearch.validators.kubedb.com": the server is
      currently unable to handle the request, Internal error occurred: failed calling
      webhook "dormantdatabase.validators.kubedb.com": the server is currently unable
      to handle the request, Internal error occurred: failed calling webhook "memcached.validators.kubedb.com":
      the server is currently unable to handle the request, Internal error occurred:
      failed calling webhook "mongodb.validators.kubedb.com": the server is currently
      unable to handle the request, Internal error occurred: failed calling webhook
      "postgres.validators.kubedb.com": the server is currently unable to handle the
      request, Internal error occurred: failed calling webhook "etcd.validators.kubedb.com":
      the server is currently unable to handle the request]'
    reason: DeletionFailed
    status: "True"
    type: DeletionFailure
  phase: Terminating
New example of status conditions
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2019-02-13T12:54:40Z"
  deletionTimestamp: "2019-02-13T12:55:11Z"
  name: test
  resourceVersion: "1150211"
  selfLink: /api/v1/namespaces/test
  uid: 870e9db8-2f8e-11e9-9c34-08002711add2
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastProbeTime: "2019-02-13T12:55:16Z"
    lastTransitionTime: "2019-02-13T12:55:16Z"
    message: 'Failed to delete all resource types, 8 remaining: Internal error occurred:
      failed calling webhook "dormantdatabase.validators.kubedb.com": the server is
      currently unable to handle the request, Internal error occurred: failed calling
      webhook "elasticsearch.validators.kubedb.com": the server is currently unable
      to handle the request, Internal error occurred: failed calling webhook "etcd.validators.kubedb.com":
      the server is currently unable to handle the request, Internal error occurred:
      failed calling webhook "memcached.validators.kubedb.com": the server is currently
      unable to handle the request, Internal error occurred: failed calling webhook
      "mongodb.validators.kubedb.com": the server is currently unable to handle the
      request, Internal error occurred: failed calling webhook "mysql.validators.kubedb.com":
      the server is currently unable to handle the request, Internal error occurred:
      failed calling webhook "postgres.validators.kubedb.com": the server is currently
      unable to handle the request, Internal error occurred: failed calling webhook
      "redis.validators.kubedb.com": the server is currently unable to handle the
      request'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  - lastProbeTime: "2019-02-13T12:55:16Z"
    lastTransitionTime: "2019-02-13T12:55:16Z"
    message: 'Discovery failed for some groups, 2 failing: unable to retrieve the
      complete list of server APIs: mutators.kubedb.com/v1alpha1: the server is currently
      unable to handle the request, validators.kubedb.com/v1alpha1: the server is
      currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  phase: Terminatin
_

You need a code path that will remove the error condition if it eventually successfully deletes the thing, I think?

now, absolutely. I looked at other controllers and they seem to invalidate the condition by setting Type to v1.ConditionTypeFalse and setting appropriate message, reason, transition timestamp. So that is what I ended up doing

Example of updated condition
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2019-02-13T12:54:40Z"
  deletionTimestamp: "2019-02-13T12:55:11Z"
  name: test
  resourceVersion: "1150486"
  selfLink: /api/v1/namespaces/test
  uid: 870e9db8-2f8e-11e9-9c34-08002711add2
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastProbeTime: "2019-02-13T12:55:16Z"
    lastTransitionTime: "2019-02-13T12:58:03Z"
    message: All content successfully deleted
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastProbeTime: "2019-02-13T12:55:16Z"
    lastTransitionTime: "2019-02-13T12:55:16Z"
    message: 'Discovery failed for some groups, 2 failing: unable to retrieve the
      complete list of server APIs: mutators.kubedb.com/v1alpha1: the server is currently
      unable to handle the request, validators.kubedb.com/v1alpha1: the server is
      currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  phase: Terminating
_

this updates always due to the probe time. This deserves a comment here why we don't call DeepEqual on old and new status.

And I kept the earlier implemented condition to only update condition status if the message for particular condition changes.

Looking forward to the next round of feedback. If you find the design acceptable, I would like to continue with unit and e2e tests, and finally proposing documentation changes.

@xmudrii

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

As this PR is targeting 1.16, I'm adding it to the milestone
/milestone v1.16

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 28, 2019
@krmayankk

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

Sorry its not clear from the sample output which objects in the namespace is preventing the deletion of the namespace or is that just a positive case.

in the sample output, it is actually the discovery that was failing. Discovery phase is a prerequisite to the phase where deleter starts deleting objects. The deleter algorithm is written in best-effort kind of way where it tries to delete whatever it can and then reports all the errors it encountered along the way.

The way I understand this PR is that it injects additional conditions that will tell us what is preventing the deletion ?

  - lastTransitionTime: "2019-02-13T12:55:16Z"
    message: 'Discovery failed for some groups, 2 failing: unable to retrieve the
      complete list of server APIs: mutators.kubedb.com/v1alpha1: the server is currently
      unable to handle the request, validators.kubedb.com/v1alpha1: the server is
      currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure

this would be the status condition telling you that discovery failed for 2 groups - mutators.kubedb.com/v1alpha1 and validators.kubedb.com/v1alpha1. Does this help?

thanks @wozniakjan In the doc bug, it would help to actually show the deletion failure case .thanks for fixing this

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 29, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented Aug 29, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@xmudrii

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

@wozniakjan It seems like this PR needs to be rebased before it can be merged.

wozniakjan added 3 commits May 14, 2019
make generated_files UPDATE_API_KNOWN_VIOLATIONS=true
./hack/update-generated-protobuf.sh
./hack/update-openapi-spec.sh
./hack/update-bazel.sh
./hack/update-generated-swagger-docs.sh
./hack/update-generated-api-compatibility-data.sh
@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

thanks @xmudrii, rebasing on the latest master to resolve conflicts with #79386 in staging/src/k8s.io/api/core/v1/generated.pb.go and updating generated files

@wozniakjan wozniakjan force-pushed the wozniakjan:namespace_status_conditions branch from 365ce99 to b0459fe Aug 29, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 29, 2019
@deads2k

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 29, 2019
@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

/test pull-kubernetes-integration

@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

/test pull-kubernetes-bazel-test

@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

/test pull-kubernetes-e2e-gce-100-performance

@xmudrii

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

/retest

@lavalamp

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

/test pull-kubernetes-bazel-build
/test pull-kubernetes-bazel-test

@k8s-ci-robot k8s-ci-robot merged commit efaacf7 into kubernetes:master Aug 29, 2019
21 of 24 checks passed
21 of 24 checks passed
pull-kubernetes-bazel-build Job triggered.
Details
pull-kubernetes-bazel-test Job triggered.
Details
tide Not mergeable. Job pull-kubernetes-bazel-build has not succeeded.
Details
cla/linuxfoundation wozniakjan authorized
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-conformance-kind-ipv6 Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
@tariq1890

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

🎉 🎉

@liggitt

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

bazel failure was a flake: #82150

@wozniakjan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 30, 2019

Thank you everyone for reviews, suggestions, improvements and getting this in.
I will try to produce matching docs change over the weekend and then look at the changes for kube-state-metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.