Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flaky test] Garbage collector timeout tests #87668

Closed
liggitt opened this issue Jan 29, 2020 · 13 comments · Fixed by #88382
Closed

[flaky test] Garbage collector timeout tests #87668

liggitt opened this issue Jan 29, 2020 · 13 comments · Fixed by #88382
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt
Copy link
Member

liggitt commented Jan 29, 2020

Which jobs are flaking:
pull-kubernetes-e2e-gce

Which test(s) are flaking:

  • [sig-api-machinery] Garbage collector should not be blocked by dependency circle [Conformance]
  • [sig-api-machinery] Garbage collector should orphan RS created by deployment when deleteOptions.PropagationPolicy is Orphan [Conformance]
  • [sig-api-machinery] Garbage collector should support cascading deletion of custom resources
  • [sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Conformance]
  • [sig-api-machinery] Garbage collector should delete RS created by deployment when not orphaning [Conformance]
  • [sig-api-machinery] Garbage collector should delete pods created by rc when not orphaning [Conformance]
  • [sig-api-machinery] Garbage collector should delete jobs and pods created by cronjob

Testgrid link:
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=Garbage&width=10

Reason for failure:
timeout waiting for deletion to complete

Anything else we need to know:
seems to have gotten much worse on/after 1/24/2020:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24&test=Garbage

See the time period on 2/11 for lots of examples

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050

number 2 flake according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json

/sig api-machinery

@liggitt liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 29, 2020
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 29, 2020
@liggitt liggitt changed the title Garbage collector should not be blocked by dependency circle [flaky test] Garbage collector should not be blocked by dependency circle Jan 29, 2020
@liggitt
Copy link
Member Author

liggitt commented Jan 29, 2020

from the failure in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050:

gc events related to the test:

I0124 12:31:59.328882       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:31:59.333422       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:31:59.333442       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:32:12.256961       1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] to attemptToDelete, because its owner [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848] is deletingDependents
I0124 12:32:12.257019       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:32:26.585144       1 garbagecollector.go:459] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"pod3", UID:"d8e5f600-770e-4476-b421-217a128dc2df", Controller:(*bool)(0xc006325f3a), BlockOwnerDeletion:(*bool)(0xc006325f3b)}, Namespace:"gc-7066"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"pod2", UID:"e6d3b93e-2d0c-44f3-9606-4d45e0563ca7", Controller:(*bool)(0xc00335522a), BlockOwnerDeletion:(*bool)(0xc00335522b)}}, will not garbage collect
I0124 12:32:26.896543       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:32:28.152245       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:32:28.869366       1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8973-crds], removed: []
I0124 12:32:43.246095       1 graph_builder.go:222] synced monitors; added 1, kept 51, removed 0
I0124 12:32:43.246138       1 garbagecollector.go:221] resynced monitors
I0124 12:32:53.454914       1 garbagecollector.go:241] synced garbage collector
I0124 12:32:53.455201       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:32:53.455218       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:04.486694       1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] to attemptToDelete, because its owner [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7] is deletingDependents
I0124 12:33:05.020213       1 garbagecollector.go:482] processing object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df], some of its owners and its dependent [[v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground
I0124 12:33:05.338836       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:33:05.834126       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:06.637367       1 garbagecollector.go:482] processing object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df], some of its owners and its dependent [[v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground
I0124 12:33:07.352110       1 garbagecollector.go:493] at least one owner of object [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground
I0124 12:33:12.389962       1 graph_builder.go:510] add [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df] to the attemptToDelete, because it's waiting for its dependents to be deleted
I0124 12:33:12.389983       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:12.390499       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:33:12.394994       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:12.600109       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod2, uid: e6d3b93e-2d0c-44f3-9606-4d45e0563ca7]
I0124 12:33:13.182483       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:13.196814       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]
I0124 12:33:13.571769       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod1, uid: f1ac1e94-d925-43ce-810b-8cf4314f8848]

test timed out:

12:33:28.425: failed to wait for all pods to be deleted: timed out waiting for the condition

gc events related to the test:

I0124 12:33:28.651239       1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-common-group.example.com/v4, Resource=e2e-test-crd-publish-openapi-1403-crds crd-publish-openapi-test-common-group.example.com/v5, Resource=e2e-test-crd-publish-openapi-3515-crds], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8973-crds]
I0124 12:33:40.958391       1 graph_builder.go:222] synced monitors; added 2, kept 51, removed 1
I0124 12:33:40.958472       1 garbagecollector.go:221] resynced monitors
I0124 12:33:43.518300       1 garbagecollector.go:241] synced garbage collector
I0124 12:33:43.518311       1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]
I0124 12:33:44.311382       1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/Pod, namespace: gc-7066, name: pod3, uid: d8e5f600-770e-4476-b421-217a128dc2df]

@liggitt
Copy link
Member Author

liggitt commented Jan 29, 2020

cc @caesarxuchao

@fedebongio
Copy link
Contributor

/assign @liggitt
thank you for fixing it!

@liggitt liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 8, 2020
@liggitt liggitt changed the title [flaky test] Garbage collector should not be blocked by dependency circle [flaky test] Garbage collector should not be blocked by dependency circle [Conformance] Feb 8, 2020
@liggitt
Copy link
Member Author

liggitt commented Feb 8, 2020

@liggitt
Copy link
Member Author

liggitt commented Feb 8, 2020

looking at the debug logs from the last several failures, filtering remaining pods to name+finalizers+ownerReferences with:

jq '[.[] | {"metadata":{"name":.metadata.name,"finalizers":(.metadata.finalizers // []), "ownerReferences": .metadata.ownerReferences}}]'

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226216133338599424:

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "594ead6c-1e42-45aa-a5b8-6f3306cd464f",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "4d2a5407-e124-4056-8ae5-514abdf13115",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226199409381871617:

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "b6ffe0f4-6b9c-4ed7-a0c8-b9f6e96cd5de",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226179642994462720

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "26f20e5b-8d91-45a8-90f3-3de326cbff50",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod2",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod1",
          "uid": "af8bf19e-0860-4749-8aca-f3ee6f0d8702",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "73d4a2e1-16a4-42fc-a2c8-2e43a5fa5f47",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226179511796633600

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "dfa0ad4c-7a68-48dc-9cc8-60a3e85e9f14",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226155482490605569

[
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "42b0f0c3-0e9e-460a-9300-007b58a69386",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226128057127931904

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "b7191cbb-6fbc-4f74-bb0f-9775c1447fe5",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "692e0681-5196-4bc3-9a9b-23ba40678f12",
          "controller": true,
          "blockOwnerDeletion": false
        }
      ]
    }
  }
]

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226109045207732224

[
  {
    "metadata": {
      "name": "pod1",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod3",
          "uid": "da34457c-b6f8-47d2-9cc8-0dfbf07ba23e",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod2",
      "finalizers": [
        "foregroundDeletion"
      ],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod1",
          "uid": "ceaf421f-24f2-4233-bcbb-8cb2260c9ab9",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  },
  {
    "metadata": {
      "name": "pod3",
      "finalizers": [],
      "ownerReferences": [
        {
          "apiVersion": "v1",
          "kind": "Pod",
          "name": "pod2",
          "uid": "2a81cb42-acbb-4678-ac5c-bd8f5540d3ff",
          "controller": true,
          "blockOwnerDeletion": true
        }
      ]
    }
  }
]

@liggitt
Copy link
Member Author

liggitt commented Feb 9, 2020

no consistent pattern between failures, GC just seems to have been running slower and sometimes timed out the test. tracked likely performance degradation to #86015, fix in #87957

@liggitt
Copy link
Member Author

liggitt commented Feb 9, 2020

note that #87957 should return performance to pre-1/24 levels, but this test was flaky even before that - https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-01-23&ci=0&pr=1&test=Garbage.*circle

@liggitt
Copy link
Member Author

liggitt commented Feb 10, 2020

Further debugging of GC lag issues in kube-controller-manager in e2e (after #87957 merged)

Combined log of e2e+kcm+apiserver for the timeout test failure from https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226918786276265984 is at https://gist.github.com/liggitt/8b9899559688c37c3b20623d4af48cb8

The RC is deleted by the e2e test:

17:37:27.945819 api 1 httplog.go:90] verb="DELETE" URI="/api/v1/namespaces/gc-4201/replicationcontrollers/simpletest.rc" latency=338.224039ms resp=200 UserAgent="e2e.test/v0.0.0 (linux/amd64) kubernetes/$Format -- [sig-api-machinery] Garbage collector should keep the rc around until all its pods are deleted if the deleteOptions says so [Conformance]" srcIP="35.225.169.82:33898": 

GC starts running attemptToDelete on the rc and all pods within a second:

17:37:28.238460 kcm 1 graph_builder.go:510] add [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] to the attemptToDelete, because it's waiting for its dependents to be deleted
17:37:28.238505 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f]
17:37:28.242244 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b]
17:37:28.242289 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac]
17:37:28.242309 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65]
17:37:28.242328 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234]
17:37:28.242353 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260]
17:37:28.242369 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce]
17:37:28.242387 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145]
17:37:28.242409 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e]
17:37:28.242426 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f]
17:37:28.242442 kcm 1 garbagecollector.go:404] processing item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]

attemptToDeleteItem on the RC (started at 17:37:28.242442) takes ~17 seconds to process (end result is calling processDeletingDependentsItem on all the child pods):

17:37:45.250386 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254778 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254828 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254838 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254849 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254862 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254871 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254882 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254893 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents
17:37:45.254903 kcm 1 garbagecollector.go:531] adding [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f] to attemptToDelete, because its owner [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e] is deletingDependents

attemptToDelete on the pods (started at 17:37:28.238505 - 17:37:28.242426) also takes ~17 seconds to complete (end result is GC making a DELETE API call on each pod):

17:37:45.922114 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac] with propagation policy Background
17:37:45.927640 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145] with propagation policy Background
17:37:45.928498 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234] with propagation policy Background
17:37:45.928831 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b] with propagation policy Background
17:37:45.929030 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce] with propagation policy Background
17:37:45.929567 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260] with propagation policy Background
17:37:45.931980 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65] with propagation policy Background
17:37:45.932188 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f] with propagation policy Background
17:37:45.932399 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e] with propagation policy Background
17:37:45.932584 kcm 1 garbagecollector.go:517] delete object [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f] with propagation policy Background

2 seconds of ~17 second delay is the GC controller rebuilding its restmapper via discovery:

17:37:43.255017 API 1 httplog.go:90] verb="GET" URI="/api?timeout=32s" latency=21.112843ms resp=200 UserAgent="kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/9ef89b4/system:serviceaccount:kube-system:generic-garbage-collector" srcIP="[::1]:44480": 
...
17:37:45.045271 API 1 httplog.go:90] verb="GET" URI="/apis/crd-publish-openapi-test-multi-ver.example.com/v2?timeout=32s" latency=205.259µs resp=200 UserAgent="kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/9ef89b4/system:serviceaccount:kube-system:generic-garbage-collector" srcIP="[::1]:44480": 

The other 15 seconds are unaccounted for.

Then, GC reacts to changing API group availability by performing its periodic resync of informers, which fails once (because another parallel e2e test removed a CRD at the same time), retries, takes a total of 39 seconds, and blocks item processing until complete:

17:37:47.301103 kcm 1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 1): added: [crd-publish-openapi-test-multi-ver.example.com/v4, Resource=e2e-test-crd-publish-openapi-8940-crds mygroup.example.com/v1beta1, Resource=foovgz9bas], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8940-crds crd-publish-openapi-test-waldo.example.com/v1beta1, Resource=e2e-test-crd-publish-openapi-7903-crds]
17:37:47.301150 kcm 1 garbagecollector.go:206] reset restmapper
...
17:38:01.575773 kcm 1 garbagecollector.go:218] failed to sync resource monitors (attempt 1): couldn't look up resource "mygroup.example.com/v1beta1, Resource=foovgz9bas": no matches for mygroup.example.com/v1beta1, Resource=foovgz9bas
...
17:38:05.968107 kcm 1 garbagecollector.go:200] syncing garbage collector with updated resources from discovery (attempt 2): added: [kubectl.example.com/v1, Resource=e2e-test-kubectl-4050-crds], removed: [crd-publish-openapi-test-multi-ver.example.com/v3, Resource=e2e-test-crd-publish-openapi-8940-crds crd-publish-openapi-test-waldo.example.com/v1beta1, Resource=e2e-test-crd-publish-openapi-7903-crds]
17:38:05.968123 kcm 1 garbagecollector.go:206] reset restmapper
...
17:38:22.348467 kcm 1 garbagecollector.go:221] resynced monitors
...
17:38:26.784492 kcm 1 garbagecollector.go:241] synced garbage collector

After resyncing monitors, it immediately adds all the items that received updates or were deleted while it was resyncing, including the 10 deleted pods (which no longer exist) and their parent RC:

17:38:26.789723 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-8zw6m, uid: bd75d687-0a9a-4822-92de-9185a2ba5f2e]
17:38:26.789799 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: csi-mock-volumes-7412, name: csi-mockplugin-0, uid: f0a9aa9d-a9f6-4983-8d9e-3c440bee3cb7]
17:38:26.789813 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-dvqpg, uid: c7d9308d-66c6-4ca6-b006-bd611cd76dce]
17:38:26.789834 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpathplugin-c9chn, uid: 25a61b65-b8b3-4255-8ebb-81a59dbde81d]
17:38:26.789846 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-9s447, uid: 24e4a3fb-bdd0-45a8-9d5a-c568fb3b9145]
17:38:26.789863 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: csi-mock-volumes-7412, name: csi-mockplugin-resizer-6795b775bd, uid: 035b0da2-fe6a-4d47-9481-cac8ea16e2bf]
17:38:26.789875 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-27q4d, uid: 0202083c-033f-418d-9201-f5fb6a5e090b]
17:38:26.789893 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: csi-mock-volumes-7412, name: csi-mockplugin-resizer-0, uid: 0a7f0992-19ee-46da-9928-6ca9d74bf6c3]
17:38:26.789903 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x62h4, uid: 5727bc26-90b6-4641-bc0c-8c429bc30260]
17:38:26.789920 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpathplugin-0, uid: 39be01b1-2caf-4301-8b6e-c6acb232719f]
17:38:26.803712 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-kbm92, uid: cda63cb5-ecdb-4c27-be10-12764bb7adac]
17:38:26.803774 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpathplugin-868789c464, uid: a9e721d1-b620-4ecb-ac0e-815b4823621b]
17:38:26.803794 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-x8wgx, uid: 005d96f0-a5b1-48ac-97ca-f701c81dcc65]
17:38:26.803825 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-provisioner-pd8pp, uid: 676703a3-c2b6-46ad-b99f-4f6db5456fdc]
17:38:26.803842 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-5gw5k, uid: c6636382-a4c0-4ca5-a290-9ae7a9fe635f]
17:38:26.803872 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpath-provisioner-bcbc9d97c, uid: f0243906-1f4e-4806-bc66-90c88b82ca07]
17:38:26.803885 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-qtngh, uid: 5d6a18a2-e9f1-420d-883c-e1702df8d67f]
17:38:26.803905 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpath-provisioner-0, uid: 187b233b-f658-45e0-b65c-24f593baedb3]
17:38:26.803916 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: gc-4201, name: simpletest.rc-zxmp4, uid: 19ffc260-f849-4aa3-913a-a740361eb234]
17:38:26.803936 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-resizer-g7nnk, uid: 987cf1df-4c11-473d-80ba-c2c8ac685e8c]
17:38:26.803946 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: statefulset-916, name: ss2-65c7964b94, uid: e83430bd-5a5b-4484-b22a-824fd7d34640]
17:38:26.803975 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: statefulset-916, name: ss2-84f9d6bf57, uid: 0365ca7a-2afc-4d87-817f-2778fddb39e4]
17:38:26.803992 kcm 1 garbagecollector.go:404] processing item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]
17:38:26.804009 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: services-3079, name: externalname-service-q2mjx, uid: d75b9ded-269e-40ad-967e-7e8cfafe105c]
17:38:26.804026 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: webhook-5381, name: e2e-test-webhook-8wbpj, uid: ba922f46-f6d1-4a21-a580-8b980b1ba1be]
17:38:26.804043 kcm 1 garbagecollector.go:404] processing item [apps/v1/ReplicaSet, namespace: webhook-5381, name: sample-webhook-deployment-5f65f8c764, uid: 9bc706aa-5f5b-4396-bcb2-73a741b40eb0]
17:38:26.804060 kcm 1 garbagecollector.go:404] processing item [discovery.k8s.io/v1beta1/EndpointSlice, namespace: provisioning-5112, name: csi-hostpath-attacher-tn9d4, uid: 4e22b660-2fcd-4d32-bd1c-28457119f4bb]
17:38:26.804096 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: provisioning-5112, name: csi-hostpath-attacher-57966f57c7, uid: a0b9eb33-0271-453b-b071-205e40d0678f]
17:38:26.804108 kcm 1 garbagecollector.go:404] processing item [v1/Pod, namespace: provisioning-5112, name: csi-hostpath-attacher-0, uid: f15910f1-04f9-40c7-a38d-e57691a3f8fc]
17:38:26.804124 kcm 1 garbagecollector.go:404] processing item [apps/v1/ControllerRevision, namespace: csi-mock-volumes-7412, name: csi-mockplugin-cc57b5697, uid: c540c991-b3d6-4959-afa5-530849ab21b1]

This time there is a ~16 seconds delay from attemptToDelete starting to handle the RC (at 17:38:26.803992) and the foreground deletion finalizer being removed:

17:38:42.991192 kcm 1 garbagecollector.go:526] remove DeleteDependents finalizer for item [v1/ReplicationController, namespace: gc-4201, name: simpletest.rc, uid: e1f2ae7e-5073-431c-bb22-fb020cc4e67e]

But the e2e test has already failed:

17:38:29.657: e2e: FAIL: failed to delete the rc: timed out waiting for the condition

@jpbetz
Copy link
Contributor

jpbetz commented Feb 10, 2020

cc @caesarxuchao

@liggitt
Copy link
Member Author

liggitt commented Feb 10, 2020

The informer resync is not unexpected, because we have lots of parallel e2e tests that add/remove CRDs. The extra time for GC to recover from an API group sync failure (39 seconds, in this case) is not included in most of the GC e2e timeout periods.

The 17 and 15 second execution times for attemptToDelete are worthy of more investigation.

Combined, they made GC easily exceed the 60 second period (actual time: the should keep the rc around until all its pods are deleted if the deleteOptions says so test waits for the RC to be removed

@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 12, 2020
@liggitt
Copy link
Member Author

liggitt commented Feb 13, 2020

adding fine-grained trace.LogIfLong statements to attemptToDelete in particular might be helpful in revealing exactly where we are spending so much time

@liggitt liggitt assigned apelisse and caesarxuchao and unassigned liggitt Feb 13, 2020
@liggitt liggitt changed the title [flaky test] Garbage collector should not be blocked by dependency circle [Conformance] [flaky test] Garbage collector timeout tests Feb 13, 2020
@jpbetz
Copy link
Contributor

jpbetz commented Feb 20, 2020

adding fine-grained trace.LogIfLong statements to attemptToDelete in particular might be helpful in revealing exactly where we are spending so much time

Tried this out with #88297. The long delays outside of sync.Monitors are almost entirely due to the latency of attemptToDeleteItem > getObject > gc.apiResource > gc.restMapper.RESTMapping > getDelegate > GetAPIGroupResources

I'm guessing that getDelegate is fetching from discovery in this codepath for the same reason GarbageCollector.Sync does: another parallel e2e test removed a CRD at the same time.

more details: #88297 (comment)

@jpbetz
Copy link
Contributor

jpbetz commented Feb 20, 2020

Traced this down to the discovery memCacheClient's cache refresh. See #88382 for a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants