-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flaky test] Garbage collector timeout tests #87668
Comments
from the failure in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050: gc events related to the test:
test timed out:
gc events related to the test:
|
/assign @liggitt |
top flake according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json |
looking at the debug logs from the last several failures, filtering remaining pods to name+finalizers+ownerReferences with: jq '[.[] | {"metadata":{"name":.metadata.name,"finalizers":(.metadata.finalizers // []), "ownerReferences": .metadata.ownerReferences}}]' [
{
"metadata": {
"name": "pod1",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod3",
"uid": "594ead6c-1e42-45aa-a5b8-6f3306cd464f",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "4d2a5407-e124-4056-8ae5-514abdf13115",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "b6ffe0f4-6b9c-4ed7-a0c8-b9f6e96cd5de",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod1",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod3",
"uid": "26f20e5b-8d91-45a8-90f3-3de326cbff50",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod2",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod1",
"uid": "af8bf19e-0860-4749-8aca-f3ee6f0d8702",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "73d4a2e1-16a4-42fc-a2c8-2e43a5fa5f47",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "dfa0ad4c-7a68-48dc-9cc8-60a3e85e9f14",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "42b0f0c3-0e9e-460a-9300-007b58a69386",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod1",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod3",
"uid": "b7191cbb-6fbc-4f74-bb0f-9775c1447fe5",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod3",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "692e0681-5196-4bc3-9a9b-23ba40678f12",
"controller": true,
"blockOwnerDeletion": false
}
]
}
}
] [
{
"metadata": {
"name": "pod1",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod3",
"uid": "da34457c-b6f8-47d2-9cc8-0dfbf07ba23e",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod2",
"finalizers": [
"foregroundDeletion"
],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod1",
"uid": "ceaf421f-24f2-4233-bcbb-8cb2260c9ab9",
"controller": true,
"blockOwnerDeletion": true
}
]
}
},
{
"metadata": {
"name": "pod3",
"finalizers": [],
"ownerReferences": [
{
"apiVersion": "v1",
"kind": "Pod",
"name": "pod2",
"uid": "2a81cb42-acbb-4678-ac5c-bd8f5540d3ff",
"controller": true,
"blockOwnerDeletion": true
}
]
}
}
] |
note that #87957 should return performance to pre-1/24 levels, but this test was flaky even before that - https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-01-23&ci=0&pr=1&test=Garbage.*circle |
Further debugging of GC lag issues in kube-controller-manager in e2e (after #87957 merged) Combined log of e2e+kcm+apiserver for the timeout test failure from https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1226918786276265984 is at https://gist.github.com/liggitt/8b9899559688c37c3b20623d4af48cb8 The RC is deleted by the e2e test:
GC starts running attemptToDelete on the rc and all pods within a second:
attemptToDeleteItem on the RC (started at 17:37:28.242442) takes ~17 seconds to process (end result is calling processDeletingDependentsItem on all the child pods):
attemptToDelete on the pods (started at 17:37:28.238505 - 17:37:28.242426) also takes ~17 seconds to complete (end result is GC making a DELETE API call on each pod):
2 seconds of ~17 second delay is the GC controller rebuilding its restmapper via discovery:
The other 15 seconds are unaccounted for. Then, GC reacts to changing API group availability by performing its periodic resync of informers, which fails once (because another parallel e2e test removed a CRD at the same time), retries, takes a total of 39 seconds, and blocks item processing until complete:
After resyncing monitors, it immediately adds all the items that received updates or were deleted while it was resyncing, including the 10 deleted pods (which no longer exist) and their parent RC:
This time there is a ~16 seconds delay from attemptToDelete starting to handle the RC (at 17:38:26.803992) and the foreground deletion finalizer being removed:
But the e2e test has already failed:
|
The informer resync is not unexpected, because we have lots of parallel e2e tests that add/remove CRDs. The extra time for GC to recover from an API group sync failure (39 seconds, in this case) is not included in most of the GC e2e timeout periods. The 17 and 15 second execution times for attemptToDelete are worthy of more investigation. Combined, they made GC easily exceed the 60 second period (actual time: the |
adding fine-grained trace.LogIfLong statements to attemptToDelete in particular might be helpful in revealing exactly where we are spending so much time |
Tried this out with #88297. The long delays outside of sync.Monitors are almost entirely due to the latency of I'm guessing that more details: #88297 (comment) |
Traced this down to the discovery |
Which jobs are flaking:
pull-kubernetes-e2e-gce
Which test(s) are flaking:
Testgrid link:
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&include-filter-by-regex=Garbage&width=10
Reason for failure:
timeout waiting for deletion to complete
Anything else we need to know:
seems to have gotten much worse on/after 1/24/2020:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24&test=Garbage
See the time period on 2/11 for lots of examples
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1220677312073175050
number 2 flake according to http://storage.googleapis.com/k8s-metrics/flakes-latest.json
/sig api-machinery
The text was updated successfully, but these errors were encountered: