Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: Two consistently failing upgrade tests: fix or remove? #69475

Closed
jberkus opened this Issue Oct 5, 2018 · 15 comments

Comments

Projects
None yet
5 participants
@jberkus
Copy link

jberkus commented Oct 5, 2018

Master-upgrade tests are supposed to be blocking for the release. However, two of the upgrade tests have been consistently failing for at least the last 3 weeks. At this point, we need to decide whether these are going to get fixed, or removed from the sig-release-master-upgrade dashboard as non-blocking. Possibly we should just stop running them.

Those tests are:

Have these tests ever passed?

/kind failing-test
/sig test-infrastructure
/sig release
/priority important-soon

cc:

@cjwagner
@AishSundar
@mohammedzee1000

@cjwagner

This comment has been minimized.

Copy link
Member

cjwagner commented Oct 5, 2018

Most recent pass for ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new is from 8/3 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1587/

Most recent pass for ci-kubernetes-e2e-gke-gci-new-gci-master-upgrade-cluster-new is from 8/6 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-new-gci-master-upgrade-cluster-new/2216/

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 5, 2018

Yeah, it kind of looks like the 2nd test used to pass reasonably regularly, but that goes back to the beginning of our history.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 5, 2018

These tests appear to belong to sig-gcp, so cc'ing them here. I've already asked cluster-lifecycle and these aren't the upgrade tests they maintain.

/sig gcp

@k8s-ci-robot k8s-ci-robot added the sig/gcp label Oct 5, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 5, 2018

The gce test (1st one) used to pass earlier (around 1.11 and early 1.12 timeframe). But the recent issue with these 2 jobs seem to be the consistent timeout issue. Looking at the triage dahsboard, the timeouts seem to be occuring earlier as well, but became consistent from mid August.

/cc @roberthbailey @zmerlynn

@roberthbailey told me that @zmerlynn was going to start looking at upgrade, downgrade test suites and understand what exact upgrade flows we should be testing (and blocking?). @zmerlynn would you be able to help shed some light on the timeouts here and if these tests make sense to run and block k8s releases on?

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 16, 2018

These tests are still failing, increasing the timeout appears not to have helped at all.

See kubernetes/test-infra#9802

What now?

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Oct 18, 2018

We've got signal :-)

Splitting into parallel tests & serial/disruptive tests gives us signal on each:


https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-cluster-new-parallel

That's broadly green except for the occasional flake on [sig-network] Services should have session affinity work for LoadBalancer service with ESIPP on [Slow] [DisabledForLargeClusters]


https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-cluster-new

This one takes longer, so we only have two runs so far, both of which flaked but on different tests 🎉

[sig-scheduling] SchedulerPriorities [Serial] Pod should be preferably scheduled to nodes which satisfy its limits
[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 18, 2018

This is cool. thanks @justinsb. May be we should split the tests out similarly for the gce-new-master-upgrade-cluster-new as well?

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 18, 2018

Ah I jus saw relevant discussion in kubernetes/test-infra#9802 (comment)

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Oct 19, 2018

Yes - thanks @AishSundar - splitting gce-new-master-upgrade-cluster-new also in kubernetes/test-infra#9864 (I thought there were more, but haven't found one yet)

Generally looking into why particular tests are failing. A lot of them seem to be failing because of the daemonset issue, being tracked in #69356

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 19, 2018

Thanks so much @justinsb for following through on this.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 21, 2018

Well, the good news is that these tests are no longer timing out.

The bad news is that they're still failing. Filing new issues and keeping this one open as a tracking lssue.

Related so far:
Issue #69989

@jberkus jberkus changed the title Two consistently failing upgrade tests: fix or remove? Tracking Issue: Two consistently failing upgrade tests: fix or remove? Oct 21, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 21, 2018

After all the splitting I see gce-new-master-upgrade-cluster-new-parallel, which runs slow tests in parallel, timeout similar to before after 15hrs. The others are all failing on misc tests as @jberkus indicated

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 24, 2018

I think its safe now to close this issue. @jberkus will let you decide.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 25, 2018

OK, we're down to a couple known issues causing these tests to fail, so there's no real value in a tracking issue.

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 25, 2018

@jberkus: Closing this issue.

In response to this:

OK, we're down to a couple known issues causing these tests to fail, so there's no real value in a tracking issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.