Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] ci-kubernetes-e2e-gce-scale-{performance, correctness} fail due to hanging DumpClusterLogs #85753

Closed
oxddr opened this issue Nov 29, 2019 · 10 comments · Fixed by #85858

Comments

@oxddr
Copy link
Contributor

@oxddr oxddr commented Nov 29, 2019

Which jobs are failing:
gce-master-scale-performance
gce-master-scale-correctness

Which test(s) are failing:
n/a

Since when has it been failing:
11/28

Testgrid link:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-performance
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-master-scale-correctness

Reason for failure:
Tests timed out due to longer than usual DumpClusterLogs phase: 5jVxEfY1vCk

Anything else we need to know:

/priority important-soon
/milestone v1.18
/sig scalability
/assign

@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Nov 29, 2019
@oxddr oxddr changed the title [Failing Test] ci-kubernetes-e2e-gce-scale-{performance, correctness} fails due to hanging DumpClusterLogs [Failing Test] ci-kubernetes-e2e-gce-scale-{performance, correctness} fail due to hanging DumpClusterLogs Nov 29, 2019
@hasheddan

This comment has been minimized.

Copy link
Contributor

@hasheddan hasheddan commented Nov 29, 2019

@hasheddan hasheddan added this to New (no response yet) in 1.17 CI Signal Nov 29, 2019
@oxddr

This comment has been minimized.

Copy link
Contributor Author

@oxddr oxddr commented Dec 2, 2019

/priority important-critical
All runs of performance and correctness since 11/28 are affected.

/assign @jkaniuk
/unassign
Jacek, could you have a look into this one?

@k8s-ci-robot k8s-ci-robot assigned jkaniuk and unassigned oxddr Dec 2, 2019
@hasheddan

This comment has been minimized.

Copy link
Contributor

@hasheddan hasheddan commented Dec 2, 2019

/milestone v1.17

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.18, v1.17 Dec 2, 2019
@justaugustus

This comment has been minimized.

Copy link
Member

@justaugustus justaugustus commented Dec 2, 2019

/remove-priority important-soon
/priority critical-urgent

@kubernetes/sig-scalability-bugs -- Please provide an update to @kubernetes/release-managers and @kubernetes/release-team-leads as soon as you can as this could impact the release timeline.

@jkaniuk

This comment has been minimized.

Copy link
Contributor

@jkaniuk jkaniuk commented Dec 3, 2019

DumpClusterLogs phase is stuck on doing kubectl logs for each logexporter pod.

Relevant loop:
https://github.com/kubernetes/kubernetes/blob/master/cluster/log-dump/log-dump.sh#L509

Problem started occuring between commits
7c87b5f Wed Nov 27 08:25:20 2019 -0800
and
dcb7daa Thu Nov 28 03:03:03 2019 -0800

There was one change to log-dump.sh merged during that time:
#83604
e99a325#diff-5d682968f2f3b4ff2f9f2bb9a303f650L507

which modified that loop from:

kubectl get po | while read -r pod node; do
  echo "Fetching logs from ${pod} running on ${node}"
  ...
done

to:

while read -r pod node; do
  echo "Fetching logs from ${pod} running on ${node}"
  ...
done < <(kubectl get po)

Both works fine on 100 node cluster, but the latter one seems to break on 5k clusters.
First one runs in subshell caused by the pipeline, latter one runs the loop in the main shell.

Running experiments to verify.

@hasheddan hasheddan moved this from New (no response yet) to Under investigation (prioritized) in 1.17 CI Signal Dec 3, 2019
1.17 CI Signal automation moved this from Under investigation (prioritized) to Observing (observe test failure/flake before marking as resolved) Dec 3, 2019
@hasheddan

This comment has been minimized.

Copy link
Contributor

@hasheddan hasheddan commented Dec 3, 2019

Thank you @jkaniuk! Will monitor subsequent tests to make sure fix is working as expected :)

@jkaniuk

This comment has been minimized.

Copy link
Contributor

@jkaniuk jkaniuk commented Dec 4, 2019

Looks that the revert has helped:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1202196266734325764

previously only first batch of log exports was run (then the timeout has happened).
it looked like that:

I1129 05:18:19.619] Fetching logs from logexporter-225rh running on gce-scale-cluster-minion-group-2-kk2h
... (25 occurences total)
I1129 05:18:19.696] Fetching logs from logexporter-27prq running on gce-scale-cluster-minion-group-3-51wz
W1129 10:33:20.444] 2019/11/29 10:33:20 process.go:199: Interrupt after 17h30m0s timeout during ./cluster/log-dump/log-dump.sh /workspace/_artifacts gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1200097516549312512/artifacts. Will terminate in another 15m

25 occurences are related to the level of parallelism in the bash script:
https://github.com/kubernetes/kubernetes/blob/master/cluster/log-dump/log-dump.sh#L60
readonly max_dump_processes=25
Only those 25 logs were downloaded and saved successfully on Cloud Storage bucket.

This time, all passed:

I1204 14:32:39.764] Fetching logs from logexporter-2226h running on gce-scale-cluster-minion-group-1-0zs2
... (5000 occurences total)
I1204 14:35:05.272] Fetching logs from logexporter-zzthm running on gce-scale-cluster-minion-group-2-66kk
I1204 14:35:07.962] Successfully listed marker files for successful nodes
@jkaniuk

This comment has been minimized.

Copy link
Contributor

@jkaniuk jkaniuk commented Dec 4, 2019

Something that could be related to the difference between 100 nodes and 5000 nodes runs:

Output of piped command is buffered with buffer size 64k, so when the output is larger, the command is kept running. In case of
kubectl get pods -n logexporter -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}{end}'
the threshold should be around 1200 nodes.

Unfortunately I did not manage to reproduce the problem directly in bash.

@jkaniuk

This comment has been minimized.

Copy link
Contributor

@jkaniuk jkaniuk commented Dec 4, 2019

I have managed to reproduce it in simple bash script:

#!/bin/bash
output_kb=70 # works fine for 64 and below
max_dump_processes=2
function print_1k() {
  for _ in {1..102}; do
    echo -n '123456789 '
  done
  echo xyz
}
function generator() {
  for _ in $(seq 1 "$1"); do
    print_1k
  done
}
proc=${max_dump_processes}
#generator $output_kb | cat | \
while read -r _; do
  echo "Starting" && sleep 1 && echo "done" &
  proc=$((proc - 1))
  if [[ proc -eq 0 ]]; then
    proc=${max_dump_processes}
    echo Waiting for background processes...
    wait
    echo Background processes finished
  fi
done < <(generator $output_kb | cat)
if [[ proc -gt 0 && proc -lt ${max_dump_processes} ]]; then
  wait
fi
@jkaniuk

This comment has been minimized.

Copy link
Contributor

@jkaniuk jkaniuk commented Dec 4, 2019

Basically, wait expects number of background process to reach 0, but kubectl process that is piped into the loop is still running in the background.
Previously that was not a problem as the loop was executed in separate subshell than kubectl.

jkaniuk added a commit to jkaniuk/kubernetes that referenced this issue Dec 4, 2019
Fix bug found by shellcheck in logexporter log fetching
where last wait was not working properly.
Fix DumpClusterLogs hanging in 5k nodes clusters:
kubernetes#85753

Change-Id: Id02bf9048b19e790940c7eac6d45d7fa7a3dfb2b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
1.17 CI Signal
  
Observing (observe test failure/flake...
5 participants
You can’t perform that action at this time.