Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing test] diffResources in master-upgrade #74890

Closed
mariantalla opened this Issue Mar 4, 2019 · 39 comments

Comments

@mariantalla
Copy link
Contributor

mariantalla commented Mar 4, 2019

Which jobs are failing:

  • ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new
  • ci-kubernetes-e2e-gce-new-master-upgrade-cluster

Which test(s) are failing:
diffResources

Since when has it been failing:
2019-03-01

Testgrid link:

Reason for failure:

Error: 12 leaked resources
+NAME                                                            ZONE        SIZE_GB  TYPE         STATUS
+bootstrap-e2e-78e337f9-3d2f-11e9-a656-2acb4231a6ed              us-west1-b  2        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-796bbc5d-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-85e2b93a-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-97d1f1dd-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY

Anything else we need to know:
Might be related: #74417 (a recent diffResources test failure) and #74887 (started failing at the same jobs, at the same time).

/sig storage
/sig testing
/kind failing-test
/priority critical-urgent
/milestone v1.14

cc @smourapina @alejandrox1 @kacole2 @mortent
cc @kubernetes/sig-storage-test-failures

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 5, 2019

In https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2826/build-log.txt, I see that the first thing the test does is upgrade the cluster, and that failed:

I0305 04:24:27.797] fatal error: sync: inconsistent mutex state
I0305 04:24:27.887] Test Suite Failed
W0305 04:24:27.988] !!! Error in ./hack/ginkgo-e2e.sh:143
W0305 04:24:27.989]   Error in ./hack/ginkgo-e2e.sh:143. '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --ginkgo.flakeAttempts="${FLAKE_ATTEMPTS}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-region="${REGION:-}" --gce-multizone="${MULTIZONE:-false}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --cloud-config-file="${CLOUD_CONFIG:-}" --repo-root="${KUBE_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" --network="${KUBE_GCE_NETWORK:-${KUBE_GKE_NETWORK:-e2e}}" --node-tag="${NODE_TAG:-}" --master-tag="${MASTER_TAG:-}" --cluster-monitoring-mode="${KUBE_ENABLE_CLUSTER_MONITORING:-standalone}" --prometheus-monitoring="${KUBE_ENABLE_PROMETHEUS_MONITORING:-false}" --dns-domain="${KUBE_DNS_DOMAIN:-cluster.local}" --ginkgo.slowSpecThreshold="${GINKGO_SLOW_SPEC_THRESHOLD:-300}" ${KUBE_CONTAINER_RUNTIME:+"--container-runtime=${KUBE_CONTAINER_RUNTIME}"} ${MASTER_OS_DISTRIBUTION:+"--master-os-distro=${MASTER_OS_DISTRIBUTION}"} ${NODE_OS_DISTRIBUTION:+"--node-os-distro=${NODE_OS_DISTRIBUTION}"} ${NUM_NODES:+"--num-nodes=${NUM_NODES}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} "${@:-}"' exited with status 1

I think the aborted upgrade test is what caused the resources to leak. For example, here's one leaked resource:

+bootstrap-e2e-dynamic-pvc-8b28e181-3efb-11e9-ad32-42010a8a0002  us-west1-b  1        pd-standard  READY

In the job, it is used by a pod that's part of the upgrade test:

I0305 14:00:17.224541       1 pv_controller.go:512] synchronizing PersistentVolume[pvc-8b28e181-3efb-11e9-ad32-42010a8a0002]: phase: Bound, bound to: "sig-apps-statefulset-upgrade-7984/datadir-ss-0 (uid: 8b28e181-3efb-11e9-ad32-42010a8a0002)", boundByController: true
@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 5, 2019

@kubernetes/sig-cluster-lifecycle to triage the failing upgrade test.

/remove-sig storage

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Mar 6, 2019

I didn't dig too deep but these are googles tests not communities.

/assign @justinsb

@spiffxp - given that Robert used to triage the inbound, who is the google owner for /cluster?

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Mar 6, 2019

fwiw it looks like there are some @kubernetes/sig-network-misc failures too.

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Mar 6, 2019

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Mar 6, 2019

fwiw it looks like there are some @kubernetes/sig-network-misc failures too.

Ref #74887 (comment), the firewall test failure is also caused by UpgradeTest unexpectedly hitting error and exiting in the middle, as described on #74890 (comment).

@mariantalla mariantalla moved this from New (no response yet) to Under investigation in 1.15 CI Signal Mar 6, 2019

@kacole2

This comment has been minimized.

Copy link
Member

kacole2 commented Mar 8, 2019

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 9, 2019

I0305 04:24:27.797] fatal error: sync: inconsistent mutex state

https://github.com/kubernetes/kubernetes/blob/master/test/e2e/lifecycle/cluster_upgrade.go#L435-L455

i would try this to see what happens with primitives only vs the go routine mutex wakeup problem:

...
var readyLock int32
....
...
ready := func() {
	if atomic.LoadInt32(&readyLock) == 0 {
		atomic.StoreInt32(&readyLock, 1)
		sem.Ready()
	}
}
...
@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 12, 2019

@neolit123 great find! I'm trying to figure out why that is panic-ing - do you have any links for the "go routine mutex wakeup problem" you referenced?

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 12, 2019

@justinsb

i tried looking for where the error string is coming from and ended up here:
https://github.com/golang/go/blob/release-branch.go1.12/src/sync/mutex.go#L143

also:
https://github.com/golang/go/blob/release-branch.go1.12/src/sync/mutex.go#L42

so a routine was woken, it attempts to acquire a lock, but it fails and the mutex enters starvation mode.
the FIFO waiters are supposed to set the mutexLocked bit after wakeup, bit this goroutine has reached a undefined state where the mutex is both starving and the lock is not set.

the cause for the "throw" might as well be process resource exhaustion, but i don't know why that may happen. on a quick look the upgrade framework doesn't seem to be doing anything wrong, but this needs debugging.

my proposal with the primitives above was to completely bypass the go mutex fairness and observe how the behavior changes with the bare instrinsics.

i wonder if the 1.12 update changed something here or if this was flaking with 1.11 too.

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 12, 2019

Great analysis @neolit123 ... I agree that something is weird, and I agree that it seems like the 1.12 update is probably not unrelated. Re-creating sync.Once without a mutex is a bit tricky, but your suggestion looks right (though it probably should use sync.CompareAndSwapUint32). I can send a PR, unless you want to?

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 12, 2019

i really don't have a setup to debug this failure properly.
edit: also please note that i want to see what happens if we remove sync.Once, but it's not a good solution. it's just a test.

any chance we can get the attention of krousey on this, as he is the author of the upgrade framework logic?

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 12, 2019

I did ping @krousey and (I think) he concurs. Your analysis was much deeper than what I did, @neolit123 :-) I was thinking it was something to do with copying the mutex somehow.

What I propose we do is that we get a compare-and-swap in there (I'll send a PR) and then we get the attention of the go team as it feels like a 1.12 regression if it does fix it.

I haven't been able to reproduce the problem in isolation.

justinsb added a commit to justinsb/kubernetes that referenced this issue Mar 12, 2019

Speculative workaround for kubernetes#74890
We try using an atomic with a CAS, as a potential workaround for
issue kubernetes#74890.

Kudos to @neolit123 for the investigation & idea.

This is a speculative workaround - we are really seeing if this is
better; if not we should revert.

If it is better we should file a bug against go 1.12, and then revert!

Issue kubernetes#74890
@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 12, 2019

CAS idea is in #75305. I don't have a better idea than trying it and seeing if it works...

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 13, 2019

Great find on the (likely) underlying issue @dims. And I think you're right @neolit123 that even the workaround will not work, because it'll (probably) hit the same bug. I suspect it might not panic so it might actually end up working "accidentally" - we'll see.

But - regardless - feels like we have a root-cause. We can try your workaround @neolit123 if we think it'll help get us test signal again faster than switching go releases.

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 13, 2019

@bradfitz to give some context, we are currently scheduled to lift our code freeze by Tuesday March 19th, and stop accepting changes by Thursday March 21st (https://github.com/kubernetes/sig-release/tree/master/releases/release-1.14#timeline)

I'm willing to push out code freeze for a day or two for this, but if we can't try go1.12.1 by the 19th my gut says we need to revert. Would really make our lives easier if it was this week.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 13, 2019

the last two runs with this patch: #75305
don't have the mutex failure it seems:
https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-cluster
see ..42, 43.

but 1.12.1 is still ideal, as noted here: #74890 (comment)

something we are not 100% sure yet is if this is the actual bug we are dealing here, or with a bug in the e2e upgrade framework. #74890 (comment) this will confirm it, but i'd leave the call on action to others.

@bradfitz

This comment has been minimized.

Copy link

bradfitz commented Mar 13, 2019

We're aiming for a Go 1.12.1 soon: tonight at earliest, but likely tomorrow. Or worst case Friday if there are surprises.

@dims

This comment has been minimized.

Copy link
Member

dims commented Mar 13, 2019

@bradfitz you all rock!

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Mar 14, 2019

Looks like the hackaround in #75305 did "work" in that UpgradeTest is no longer panic-ing and we are getting test signal again. The locking is probably incorrect now and probably is still hitting the same bug and we just aren't panicing, but we have test signal!

Results look not-terrible - some flakes on ESIPP and storage, but not on the same test in the (very small) sample so far. And kudos to whoever had the good idea to trigger multiple parallels runs of this test so we got more results faster!

I believe this does suggest that we were hitting the deferred closure GC issue.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 14, 2019

Results look not-terrible

indeed, and the diffResources test passes now.

I believe this does suggest that we were hitting the deferred closure GC issue.

yes, once we revert our hack patch after a 1.12.1 bump we will be sure.

@alejandrox1 alejandrox1 moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 14, 2019

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 14, 2019

I opened #75372 to cover the steps to roll forward to go 1.12.1 Who wants to take the lead on pushing it forward? I took a guess at the required steps in the description, please comment if I got anything wrong

I feel like we should also have an issue open to plan out what a rollback to go 1.11.5 would look like just in case, because I'd like to understand how much time needs to be budgeted for that, and work back to a decision point

@bradfitz

This comment has been minimized.

Copy link

bradfitz commented Mar 14, 2019

Go 1.12.1 is out: https://golang.org/dl/

(And a Go 1.11.x update.)

neolit123 added a commit to neolit123/kubernetes that referenced this issue Mar 15, 2019

@dims

This comment has been minimized.

Copy link
Member

dims commented Mar 15, 2019

Thanks @bradfitz !

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 15, 2019

#75372 is expected to resolve this issue.

@mariantalla mariantalla moved this from Open PR-wait for >5 successes before "Resolved" to Under investigation (prioritized) in 1.15 CI Signal Mar 15, 2019

@Sajmani

This comment has been minimized.

Copy link

Sajmani commented Mar 16, 2019

ipolyzos pushed a commit to ipolyzos/kubernetes that referenced this issue Mar 16, 2019

Merge pull request kubernetes#75393 from neolit123/revert-75305-worka…
…round_once_mutex_issue

Revert "Speculative workaround for kubernetes#74890"

@spiffxp spiffxp moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 16, 2019

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 16, 2019

It's looking like we've addressed this, will wait and see what further runs over the weekend bring in
ref: #75372 (comment)
ref: #75372 (comment)

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 18, 2019

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=gce.*upgrade&test=diffResources

Here we can see occurrences of diffResources for /gce.*upgrade/ jobs increased after March 1st when golang 1.12 was introduced
Screen Shot 2019-03-18 at 1 32 52 PM

Here we can see the occurrences drop off sharply after our workaround was introduced, and stay low once the workaround was removed and golang 1.12.1 introduced
Screen Shot 2019-03-18 at 1 32 35 PM

Calling this resolved

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 18, 2019

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 18, 2019

@spiffxp: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp moved this from Open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 11) in 1.15 CI Signal Mar 18, 2019

@spiffxp spiffxp moved this from Resolved (week Mar 11) to Resolved (week Mar 18) in 1.15 CI Signal Mar 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.