[flaky test] Garbage collector timeout tests #87668

liggitt · 2020-01-29T20:53:52Z

Which jobs are flaking:
pull-kubernetes-e2e-gce

Which test(s) are flaking:

[sig-api-machinery] Garbage collector should not be blocked by dependency circle [Conformance]
[sig-api-machinery] Garbage collector should orphan RS created by deployment when deleteOptions.PropagationPolicy is Orphan [Conformance]
[sig-api-machinery] Garbage collector should support cascading deletion of custom resources
[sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Conformance]
[sig-api-machinery] Garbage collector should delete RS created by deployment when not orphaning [Conformance]
[sig-api-machinery] Garbage collector should delete pods created by rc when not orphaning [Conformance]
[sig-api-machinery] Garbage collector should delete jobs and pods created by cronjob

Testgrid link:
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=Garbage&width=10

Reason for failure:
timeout waiting for deletion to complete

Anything else we need to know:
seems to have gotten much worse on/after 1/24/2020:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24&test=Garbage

See the time period on 2/11 for lots of examples

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050

number 2 flake according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json

/sig api-machinery

liggitt · 2020-01-29T20:57:14Z

from the failure in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050:

gc events related to the test:

I0124 12:31:59.328882       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:31:59.333422       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:31:59.333442       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:32:12.256961       1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] to attemptToDelete, because its owner [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848] is deletingDependents
I0124 12:32:12.257019       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:32:26.585144       1 garbagecollector.go:459] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"pod3", UID:"d8e5f600-770e-4476-b421-217a128dc2df", Controller:(*bool)(0xc006325f3a), BlockOwnerDeletion:(*bool)(0xc006325f3b)}, Namespace:"gc-7066"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"pod2", UID:"e6d3b93e-2d0c-44f3-9606-4d45e0563ca7", Controller:(*bool)(0xc00335522a), BlockOwnerDeletion:(*bool)(0xc00335522b)}}, will not garbage collect
I0124 12:32:26.896543       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:32:28.152245       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:32:28.869366       1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8973-crds], removed: []
I0124 12:32:43.246095       1 graph_builder.go:222] synced monitors; added 1, kept 51, removed 0
I0124 12:32:43.246138       1 garbagecollector.go:221] resynced monitors
I0124 12:32:53.454914       1 garbagecollector.go:241] synced garbage collector
I0124 12:32:53.455201       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:32:53.455218       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:04.486694       1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] to attemptToDelete, because its owner [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] is deletingDependents
I0124 12:33:05.020213       1 garbagecollector.go:482] processing object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df], some of its owners and its dependent [[v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground
I0124 12:33:05.338836       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:33:05.834126       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:06.637367       1 garbagecollector.go:482] processing object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df], some of its owners and its dependent [[v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground
I0124 12:33:07.352110       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:33:12.389962       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:33:12.389983       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:12.390499       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:33:12.394994       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:12.600109       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:13.182483       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:13.196814       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:33:13.571769       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]

test timed out:

12:33:28.425: failed to wait for all pods to be deleted: timed out waiting for the condition

gc events related to the test:

I0124 12:33:28.651239       1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-common-group.example.com/v4, Resource=e2e-test-crd-publish-openapi-1403-crds crd-publish-openapi-test-common-group.example.com/v5, Resource=e2e-test-crd-publish-openapi-3515-crds], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8973-crds]
I0124 12:33:40.958391       1 graph_builder.go:222] synced monitors; added 2, kept 51, removed 1
I0124 12:33:40.958472       1 garbagecollector.go:221] resynced monitors
I0124 12:33:43.518300       1 garbagecollector.go:241] synced garbage collector
I0124 12:33:43.518311       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:44.311382       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]

liggitt · 2020-01-29T21:05:27Z

cc @caesarxuchao

fedebongio · 2020-01-30T21:11:00Z

/assign @liggitt
thank you for fixing it!

liggitt · 2020-02-08T20:14:24Z

top flake according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json

liggitt · 2020-02-08T20:45:20Z

looking at the debug logs from the last several failures, filtering remaining pods to name+finalizers+ownerReferences with:

jq '[.[] | {"metadata":{"name":.metadata.name,"finalizers":(.metadata.finalizers // []), "ownerReferences": .metadata.ownerReferences}}]'

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226216133338599424:

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "594ead6c-1e42-45aa-a5b8-6f3306cd464f",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "4d2a5407-e124-4056-8ae5-514abdf13115",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226199409381871617:

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "b6ffe0f4-6b9c-4ed7-a0c8-b9f6e96cd5de",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226179642994462720

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "26f20e5b-8d91-45a8-90f3-3de326cbff50",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod2",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod1",
          "uid": "af8bf19e-0860-4749-8aca-f3ee6f0d8702",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "73d4a2e1-16a4-42fc-a2c8-2e43a5fa5f47",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226179511796633600

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "dfa0ad4c-7a68-48dc-9cc8-60a3e85e9f14",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226155482490605569

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "42b0f0c3-0e9e-460a-9300-007b58a69386",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226128057127931904

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "b7191cbb-6fbc-4f74-bb0f-9775c1447fe5",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "692e0681-5196-4bc3-9a9b-23ba40678f12",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226109045207732224

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "da34457c-b6f8-47d2-9cc8-0dfbf07ba23e",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod2",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod1",
          "uid": "ceaf421f-24f2-4233-bcbb-8cb2260c9ab9",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "2a81cb42-acbb-4678-ac5c-bd8f5540d3ff",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  }
]

liggitt · 2020-02-09T21:49:42Z

no consistent pattern between failures, GC just seems to have been running slower and sometimes timed out the test. tracked likely performance degradation to #86015, fix in #87957

liggitt · 2020-02-09T22:11:44Z

note that #87957 should return performance to pre-1/24 levels, but this test was flaky even before that - https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-01-23&ci=0&pr=1&test=Garbage.*circle

liggitt · 2020-02-10T22:18:17Z

Further debugging of GC lag issues in kube-controller-manager in e2e (after #87957 merged)

Combined log of e2e+kcm+apiserver for the timeout test failure from https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226918786276265984 is at https://gist.github.com/liggitt/8b9899559688c37c3b20623d4af48cb8

The RC is deleted by the e2e test:

17:37:27.945819 api 1 httplog.go:90] verb="DELETE" URI="/api/v1/namespaces/gc-4201/replicationcontrollers/simpletest.rc" latency=338.224039ms resp=200 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Conformance]" srcIP="35.225.169.82:33898":

GC starts running attemptToDelete on the rc and all pods within a second:

17:37:28.238460 kcm 1 graph_builder.go:510] add [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] to the attemptToDelete, because it's waiting for its dependents to be deleted
17:37:28.238505 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f]
17:37:28.242244 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b]
17:37:28.242289 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac]
17:37:28.242309 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65]
17:37:28.242328 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234]
17:37:28.242353 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260]
17:37:28.242369 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce]
17:37:28.242387 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145]
17:37:28.242409 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e]
17:37:28.242426 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f]
17:37:28.242442 kcm 1 garbagecollector.go:404] processing item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]

attemptToDeleteItem on the RC (started at 17:37:28.242442) takes ~17 seconds to process (end result is calling processDeletingDependentsItem on all the child pods):

17:37:45.250386 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254778 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254828 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254838 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254849 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254862 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254871 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254882 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254893 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254903 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents

attemptToDelete on the pods (started at 17:37:28.238505 - 17:37:28.242426) also takes ~17 seconds to complete (end result is GC making a DELETE API call on each pod):

17:37:45.922114 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac] with propagation policy Background
17:37:45.927640 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145] with propagation policy Background
17:37:45.928498 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234] with propagation policy Background
17:37:45.928831 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b] with propagation policy Background
17:37:45.929030 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce] with propagation policy Background
17:37:45.929567 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260] with propagation policy Background
17:37:45.931980 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65] with propagation policy Background
17:37:45.932188 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f] with propagation policy Background
17:37:45.932399 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e] with propagation policy Background
17:37:45.932584 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f] with propagation policy Background

2 seconds of ~17 second delay is the GC controller rebuilding its restmapper via discovery:

17:37:43.255017 API 1 httplog.go:90] verb="GET" URI="/api?timeout=32s" latency=21.112843ms resp=200 UserAgent="kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/9ef89b4/system:serviceaccount:kube-system:generic-garbage-collector" srcIP="[::1]:44480": 
...
17:37:45.045271 API 1 httplog.go:90] verb="GET" URI="/apis/crd-publish-openapi-test-multi-ver.example.com/v2?timeout=32s" latency=205.259µs resp=200 UserAgent="kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/9ef89b4/system:serviceaccount:kube-system:generic-garbage-collector" srcIP="[::1]:44480":

The other 15 seconds are unaccounted for.

Then, GC reacts to changing API group availability by performing its periodic resync of informers, which fails once (because another parallel e2e test removed a CRD at the same time), retries, takes a total of 39 seconds, and blocks item processing until complete:

17:37:47.301103 kcm 1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-multi-ver.example.com/v4, Resource=e2e-test-crd-publish-openapi-8940-crds mygroup.example.com/v1beta1, Resource=foovgz9bas], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8940-crds crd-publish-openapi-test-waldo.example.com/v1beta1, Resource=e2e-test-crd-publish-openapi-7903-crds]
17:37:47.301150 kcm 1 garbagecollector.go:206] reset restmapper
...
17:38:01.575773 kcm 1 garbagecollector.go:218] failed to sync resource monitors (attempt 1): couldn't look up resource "mygroup.example.com/v1beta1, Resource=foovgz9bas": no matches for mygroup.example.com/v1beta1, Resource=foovgz9bas
...
17:38:05.968107 kcm 1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 2): added: [kubectl.example.com/v1, Resource=e2e-test-kubectl-4050-crds], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8940-crds crd-publish-openapi-test-waldo.example.com/v1beta1, Resource=e2e-test-crd-publish-openapi-7903-crds]
17:38:05.968123 kcm 1 garbagecollector.go:206] reset restmapper
...
17:38:22.348467 kcm 1 garbagecollector.go:221] resynced monitors
...
17:38:26.784492 kcm 1 garbagecollector.go:241] synced garbage collector

After resyncing monitors, it immediately adds all the items that received updates or were deleted while it was resyncing, including the 10 deleted pods (which no longer exist) and their parent RC:

17:38:26.789723 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e]
17:38:26.789799 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: csi-mock-volumes-7412, name: csi-mockplugin-0, uid: f0a9aa9d-a9f6-4983-8d9e-3c440bee3cb7]
17:38:26.789813 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce]
17:38:26.789834 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpathplugin-c9chn, uid: 25a61b65-b8b3-4255-8ebb-81a59dbde81d]
17:38:26.789846 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145]
17:38:26.789863 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: csi-mock-volumes-7412, name: csi-mockplugin-resizer-6795b775bd, uid: 035b0da2-fe6a-4d47-9481-cac8ea16e2bf]
17:38:26.789875 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b]
17:38:26.789893 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: csi-mock-volumes-7412, name: csi-mockplugin-resizer-0, uid: 0a7f0992-19ee-46da-9928-6ca9d74bf6c3]
17:38:26.789903 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260]
17:38:26.789920 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpathplugin-0, uid: 39be01b1-2caf-4301-8b6e-c6acb232719f]
17:38:26.803712 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac]
17:38:26.803774 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpathplugin-868789c464, uid: a9e721d1-b620-4ecb-ac0e-815b4823621b]
17:38:26.803794 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65]
17:38:26.803825 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-provisioner-pd8pp, uid: 676703a3-c2b6-46ad-b99f-4f6db5456fdc]
17:38:26.803842 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f]
17:38:26.803872 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpath-provisioner-bcbc9d97c, uid: f0243906-1f4e-4806-bc66-90c88b82ca07]
17:38:26.803885 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f]
17:38:26.803905 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpath-provisioner-0, uid: 187b233b-f658-45e0-b65c-24f593baedb3]
17:38:26.803916 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234]
17:38:26.803936 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-resizer-g7nnk, uid: 987cf1df-4c11-473d-80ba-c2c8ac685e8c]
17:38:26.803946 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: statefulset-916, name: ss2-65c7964b94, uid: e83430bd-5a5b-4484-b22a-824fd7d34640]
17:38:26.803975 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: statefulset-916, name: ss2-84f9d6bf57, uid: 0365ca7a-2afc-4d87-817f-2778fddb39e4]
17:38:26.803992 kcm 1 garbagecollector.go:404] processing item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]
17:38:26.804009 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: services-3079, name: externalname-service-q2mjx, uid: d75b9ded-269e-40ad-967e-7e8cfafe105c]
17:38:26.804026 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: webhook-5381, name: e2e-test-webhook-8wbpj, uid: ba922f46-f6d1-4a21-a580-8b980b1ba1be]
17:38:26.804043 kcm 1 garbagecollector.go:404] processing item [apps/v1/ReplicaSet, namespace: webhook-5381, name: sample-webhook-deployment-5f65f8c764, uid: 9bc706aa-5f5b-4396-bcb2-73a741b40eb0]
17:38:26.804060 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-attacher-tn9d4, uid: 4e22b660-2fcd-4d32-bd1c-28457119f4bb]
17:38:26.804096 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpath-attacher-57966f57c7, uid: a0b9eb33-0271-453b-b071-205e40d0678f]
17:38:26.804108 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpath-attacher-0, uid: f15910f1-04f9-40c7-a38d-e57691a3f8fc]
17:38:26.804124 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: csi-mock-volumes-7412, name: csi-mockplugin-cc57b5697, uid: c540c991-b3d6-4959-afa5-530849ab21b1]

This time there is a ~16 seconds delay from attemptToDelete starting to handle the RC (at 17:38:26.803992) and the foreground deletion finalizer being removed:

17:38:42.991192 kcm 1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]

But the e2e test has already failed:

17:38:29.657: e2e: FAIL: failed to delete the rc: timed out waiting for the condition

jpbetz · 2020-02-10T22:23:53Z

cc @caesarxuchao

liggitt · 2020-02-10T22:25:12Z

The informer resync is not unexpected, because we have lots of parallel e2e tests that add/remove CRDs. The extra time for GC to recover from an API group sync failure (39 seconds, in this case) is not included in most of the GC e2e timeout periods.

The 17 and 15 second execution times for attemptToDelete are worthy of more investigation.

Combined, they made GC easily exceed the 60 second period (actual time: the should keep the rc around until all its pods are deleted if the deleteOptions says so test waits for the RC to be removed

liggitt · 2020-02-13T19:15:23Z

adding fine-grained trace.LogIfLong statements to attemptToDelete in particular might be helpful in revealing exactly where we are spending so much time

jpbetz · 2020-02-20T04:54:16Z

adding fine-grained trace.LogIfLong statements to attemptToDelete in particular might be helpful in revealing exactly where we are spending so much time

Tried this out with #88297. The long delays outside of sync.Monitors are almost entirely due to the latency of attemptToDeleteItem > getObject > gc.apiResource > gc.restMapper.RESTMapping > getDelegate > GetAPIGroupResources

I'm guessing that getDelegate is fetching from discovery in this codepath for the same reason GarbageCollector.Sync does: another parallel e2e test removed a CRD at the same time.

more details: #88297 (comment)

jpbetz · 2020-02-20T21:54:44Z

Traced this down to the discovery memCacheClient's cache refresh. See #88382 for a fix.

liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 29, 2020

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 29, 2020

liggitt changed the title ~~Garbage collector should not be blocked by dependency circle~~ [flaky test] Garbage collector should not be blocked by dependency circle Jan 29, 2020

liggitt mentioned this issue Jan 29, 2020

Add GC e2e debug logging #87670

Merged

k8s-ci-robot assigned liggitt Jan 30, 2020

liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2020

liggitt changed the title ~~[flaky test] Garbage collector should not be blocked by dependency circle~~ [flaky test] Garbage collector should not be blocked by dependency circle [Conformance] Feb 8, 2020

liggitt mentioned this issue Feb 8, 2020

WIP - ignore no-op watch changes in update events #87956

Closed

liggitt mentioned this issue Feb 12, 2020

Add buffer for GC resync retry to GC e2e tests #88053

Merged

liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 12, 2020

liggitt assigned apelisse and caesarxuchao and unassigned liggitt Feb 13, 2020

liggitt changed the title ~~[flaky test] Garbage collector should not be blocked by dependency circle [Conformance]~~ [flaky test] Garbage collector timeout tests Feb 13, 2020

liggitt mentioned this issue Feb 19, 2020

WIP - debug gc timing #88297

Closed

jpbetz mentioned this issue Feb 20, 2020

Refresh discovery server resources for memCacheClient in parallel #88382

Merged

k8s-ci-robot closed this as completed in #88382 Feb 22, 2020

liggitt mentioned this issue Jul 26, 2020

[sig-network] EndpointSlice should create and delete Endpoints and EndpointSlices for a Service with a selector specified #92776

Closed

adabuleanu mentioned this issue May 5, 2021

garbage collector does not remove resources #101740

Closed

tallclair mentioned this issue Mar 17, 2023

Deflake GC e2e test #116724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flaky test] Garbage collector timeout tests #87668

[flaky test] Garbage collector timeout tests #87668

liggitt commented Jan 29, 2020 •

edited

Loading

liggitt commented Jan 29, 2020 •

edited

Loading

liggitt commented Jan 29, 2020

fedebongio commented Jan 30, 2020

liggitt commented Feb 8, 2020

liggitt commented Feb 8, 2020

liggitt commented Feb 9, 2020

liggitt commented Feb 9, 2020

liggitt commented Feb 10, 2020 •

edited

Loading

jpbetz commented Feb 10, 2020

liggitt commented Feb 10, 2020 •

edited

Loading

liggitt commented Feb 13, 2020

jpbetz commented Feb 20, 2020

jpbetz commented Feb 20, 2020

[flaky test] Garbage collector timeout tests #87668

[flaky test] Garbage collector timeout tests #87668

Comments

liggitt commented Jan 29, 2020 • edited Loading

liggitt commented Jan 29, 2020 • edited Loading

liggitt commented Jan 29, 2020

fedebongio commented Jan 30, 2020

liggitt commented Feb 8, 2020

liggitt commented Feb 8, 2020

liggitt commented Feb 9, 2020

liggitt commented Feb 9, 2020

liggitt commented Feb 10, 2020 • edited Loading

jpbetz commented Feb 10, 2020

liggitt commented Feb 10, 2020 • edited Loading

liggitt commented Feb 13, 2020

jpbetz commented Feb 20, 2020

jpbetz commented Feb 20, 2020

liggitt commented Jan 29, 2020 •

edited

Loading

liggitt commented Jan 29, 2020 •

edited

Loading

liggitt commented Feb 10, 2020 •

edited

Loading

liggitt commented Feb 10, 2020 •

edited

Loading