New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

Closed
spiffxp opened this Issue Dec 6, 2017 · 16 comments

Comments

Projects
None yet
8 participants
@spiffxp
Member

spiffxp commented Dec 6, 2017

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
@kubernetes/sig-cluster-lifecycle-test-failures

This job has been failing since at least 2017-11-21. It's on the sig-release-master-upgrade dashboard,
and prevents us from cutting [v1.9.0-beta.2] (kubernetes/sig-release#39). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster-parallel

kubetest --timeout triggered
@spiffxp

This comment has been minimized.

Show comment
Hide comment
@spiffxp

spiffxp Dec 11, 2017

Member

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

Maybe same issue as #56244 ?

Member

spiffxp commented Dec 11, 2017

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

Maybe same issue as #56244 ?

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 12, 2017

Member

I think I've fixed issues with the non-parallel one (both node and master downgrade failures), but this seems weird. I think there's an error in how it's configured.

From the normal downgrade (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/178?log#log):

W1211 12:18:26.502] 2017/12/11 12:18:26 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1211 12:18:26.506] Project: kubernetes-es-logging
W1211 12:18:26.506] Network Project: kubernetes-es-logging
W1211 12:18:26.506] Zone: us-central1-f
W1211 12:18:26.507] Trying to find master named 'bootstrap-e2e-master'
W1211 12:18:26.507] Looking for address 'bootstrap-e2e-master-ip'
I1211 12:18:26.608] Setting up for KUBERNETES_PROVIDER="gce".
W1211 12:18:27.388] Using master: bootstrap-e2e-master (external IP: 35.225.8.199)
I1211 12:18:28.652] Dec 11 12:18:28.652: INFO: Overriding default scale value of zero to 1
I1211 12:18:28.653] Dec 11 12:18:28.652: INFO: Overriding default milliseconds value of zero to 5000
I1211 12:18:28.777] I1211 12:18:28.776762    5867 e2e.go:384] Starting e2e run "64fefedf-de6d-11e7-9b62-0a580a3d0e17" on Ginkgo node 1
I1211 12:18:28.803] Running Suite: Kubernetes e2e suite
I1211 12:18:28.804] ===================================
I1211 12:18:28.804] Random Seed: 1512994707 - Will randomize all specs
I1211 12:18:28.804] Will run 1 of 699 specs

From this job's log (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/893?log#log):

W1212 01:41:20.197] 2017/12/12 01:41:20 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1212 01:41:20.199] Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Network Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Zone: us-central1-f
W1212 01:41:20.200] Trying to find master named 'bootstrap-e2e-master'
W1212 01:41:20.200] Looking for address 'bootstrap-e2e-master-ip'
I1212 01:41:20.301] Setting up for KUBERNETES_PROVIDER="gce".
W1212 01:41:21.064] Using master: bootstrap-e2e-master (external IP: 35.202.181.15)
I1212 01:41:24.401] Running Suite: Kubernetes e2e suite
I1212 01:41:24.401] ===================================
I1212 01:41:24.402] Random Seed: 1513042881 - Will randomize all specs
I1212 01:41:24.403] Will run 699 specs

What worries me is the last line. For some reason, this is running every e2e test we have, which just won't work.

edit: config is here https://github.com/kubernetes/test-infra/blob/master/jobs/config.json#L2906

Member

krousey commented Dec 12, 2017

I think I've fixed issues with the non-parallel one (both node and master downgrade failures), but this seems weird. I think there's an error in how it's configured.

From the normal downgrade (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/178?log#log):

W1211 12:18:26.502] 2017/12/11 12:18:26 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1211 12:18:26.506] Project: kubernetes-es-logging
W1211 12:18:26.506] Network Project: kubernetes-es-logging
W1211 12:18:26.506] Zone: us-central1-f
W1211 12:18:26.507] Trying to find master named 'bootstrap-e2e-master'
W1211 12:18:26.507] Looking for address 'bootstrap-e2e-master-ip'
I1211 12:18:26.608] Setting up for KUBERNETES_PROVIDER="gce".
W1211 12:18:27.388] Using master: bootstrap-e2e-master (external IP: 35.225.8.199)
I1211 12:18:28.652] Dec 11 12:18:28.652: INFO: Overriding default scale value of zero to 1
I1211 12:18:28.653] Dec 11 12:18:28.652: INFO: Overriding default milliseconds value of zero to 5000
I1211 12:18:28.777] I1211 12:18:28.776762    5867 e2e.go:384] Starting e2e run "64fefedf-de6d-11e7-9b62-0a580a3d0e17" on Ginkgo node 1
I1211 12:18:28.803] Running Suite: Kubernetes e2e suite
I1211 12:18:28.804] ===================================
I1211 12:18:28.804] Random Seed: 1512994707 - Will randomize all specs
I1211 12:18:28.804] Will run 1 of 699 specs

From this job's log (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/893?log#log):

W1212 01:41:20.197] 2017/12/12 01:41:20 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1212 01:41:20.199] Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Network Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Zone: us-central1-f
W1212 01:41:20.200] Trying to find master named 'bootstrap-e2e-master'
W1212 01:41:20.200] Looking for address 'bootstrap-e2e-master-ip'
I1212 01:41:20.301] Setting up for KUBERNETES_PROVIDER="gce".
W1212 01:41:21.064] Using master: bootstrap-e2e-master (external IP: 35.202.181.15)
I1212 01:41:24.401] Running Suite: Kubernetes e2e suite
I1212 01:41:24.401] ===================================
I1212 01:41:24.402] Random Seed: 1513042881 - Will randomize all specs
I1212 01:41:24.403] Will run 699 specs

What worries me is the last line. For some reason, this is running every e2e test we have, which just won't work.

edit: config is here https://github.com/kubernetes/test-infra/blob/master/jobs/config.json#L2906

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Dec 12, 2017

Contributor

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@spiffxp @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
Contributor

k8s-merge-robot commented Dec 12, 2017

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@spiffxp @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@enisoc

This comment has been minimized.

Show comment
Hide comment
@enisoc

enisoc Dec 12, 2017

Member

@BenTheElder any ideas on the above? -^

Member

enisoc commented Dec 12, 2017

@BenTheElder any ideas on the above? -^

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 12, 2017

Member

This was a wild goose chase. That message doesn't mean it's running all the specs... it's just the reporting is changed slightly for parallel runs... I think.

Member

krousey commented Dec 12, 2017

This was a wild goose chase. That message doesn't mean it's running all the specs... it's just the reporting is changed slightly for parallel runs... I think.

@BenTheElder

This comment has been minimized.

Show comment
Hide comment
@BenTheElder

BenTheElder Dec 12, 2017

Member

ACK, meetings all morning, catching up on these things now. I think this probably was flipping on parallel actually, @krzyzacy can you confirm?

Member

BenTheElder commented Dec 12, 2017

ACK, meetings all morning, catching up on these things now. I think this probably was flipping on parallel actually, @krzyzacy can you confirm?

@BenTheElder

This comment has been minimized.

Show comment
Hide comment
@BenTheElder

BenTheElder Dec 12, 2017

Member

We've (@krousey wrote, I just deployed) rolled out a change that hopefully will be safe and flip these to not run in parallel. It should take effect on any future runs.

Member

BenTheElder commented Dec 12, 2017

We've (@krousey wrote, I just deployed) rolled out a change that hopefully will be safe and flip these to not run in parallel. It should take effect on any future runs.

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 12, 2017

Member

Just to clarify @BenTheElder 's update. The downgrade step won't run in parallel. The tests that follow will still honor the parallel flag.

Member

krousey commented Dec 12, 2017

Just to clarify @BenTheElder 's update. The downgrade step won't run in parallel. The tests that follow will still honor the parallel flag.

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 13, 2017

Member

Ok from the new logs, I can see that the parallel and non-parallel jobs are getting hung on the same points now. And also helped me quickly debug that my latest fix wasn't sufficient for the test environment.

Member

krousey commented Dec 13, 2017

Ok from the new logs, I can see that the parallel and non-parallel jobs are getting hung on the same points now. And also helped me quickly debug that my latest fix wasn't sufficient for the test environment.

@krzyzacy

This comment has been minimized.

Show comment
Hide comment
@krzyzacy

krzyzacy Dec 13, 2017

Member

thanks @krousey !

Member

krzyzacy commented Dec 13, 2017

thanks @krousey !

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 13, 2017

Member

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/904 succesfully downgraded. Also, all tests passed. If this continues overnight, I say we close this issue.

Member

krousey commented Dec 13, 2017

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/904 succesfully downgraded. Also, all tests passed. If this continues overnight, I say we close this issue.

@xiangpengzhao

This comment has been minimized.

Show comment
Hide comment
@xiangpengzhao

xiangpengzhao Dec 13, 2017

Member

@krousey awesome!
We should also wait for https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster to turn green. But I believe it will :)

Member

xiangpengzhao commented Dec 13, 2017

@krousey awesome!
We should also wait for https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster to turn green. But I believe it will :)

@krousey

This comment has been minimized.

Show comment
Hide comment
@krousey

krousey Dec 13, 2017

Member
Member

krousey commented Dec 13, 2017

@xiangpengzhao

This comment has been minimized.

Show comment
Hide comment
@xiangpengzhao

xiangpengzhao Dec 13, 2017

Member

SGTM :)

Member

xiangpengzhao commented Dec 13, 2017

SGTM :)

@spiffxp

This comment has been minimized.

Show comment
Hide comment
@spiffxp

spiffxp Dec 13, 2017

Member

/close
OK I've seen a few successful downgrades, and here's a full green run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/910

Thank you all

Member

spiffxp commented Dec 13, 2017

/close
OK I've seen a few successful downgrades, and here's a full green run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/910

Thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment