Failing Test : persistent-volume-upgrade [sig-storage] #68899

tpepper · 2018-09-20T20:15:02Z

Failing Job

sig-release-master-upgrade#gce-new-master-upgrade-master

Gubernator Logs

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-master/1778

Error

Expected error:
    <*errors.errorString | 0xc420a4b980>: {
        s: "failed to get logs from pvc-tester-j4c4l for write-pod: the server does not allow this method on the requested resource (get pods pvc-tester-j4c4l)",
    }
    failed to get logs from pvc-tester-j4c4l for write-pod: the server does not allow this method on the requested resource (get pods pvc-tester-j4c4l)
not to have occurred

k8s.io/kubernetes/test/e2e/framework.(*Framework).testContainerOutputMatcher(0xc42097b340, 0x443f662, 0xf, 0xc42026a000, 0x0, 0xc420090e98, 0x1, 0x1, 0x45d0920)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/util.go:2350 +0x216
k8s.io/kubernetes/test/e2e/framework.(*Framework).TestContainerOutput(0xc42097b340, 0x443f662, 0xf, 0xc42026a000, 0x0, 0xc420090e98, 0x1, 0x1)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:475 +0x8a
k8s.io/kubernetes/test/e2e/upgrades/storage.(*PersistentVolumeUpgradeTest).testPod(0x6c17120, 0xc42097b340, 0x449c1c2, 0x20)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/storage/persistent_volumes.go:100 +0x10e
k8s.io/kubernetes/test/e2e/upgrades/storage.(*PersistentVolumeUpgradeTest).Test(0x6c17120, 0xc42097b340, 0xc42151eae0, 0x0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/storage/persistent_volumes.go:82 +0x8f
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test(0xc4217d7640, 0xc421382b80)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:447 +0x1da
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test-fm(0xc421382b80)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:516 +0x34
k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc421382b80, 0xc42157c8a0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:86 +0xa8

Triage

This test is a persistent fail in this one test bucket only, and aside from one flake it appears to consistently pass everywhere else in sig-release-master-upgrade and in sig-release-1.12-all.

/kind bug
/priority failing-test
/sig storage
@kubernetes/sig-storage-bugs
/assign @msau42
/cc @saad-ali @childsb
/milestone v1.12

The text was updated successfully, but these errors were encountered:

msau42 · 2018-09-20T20:22:33Z

looks like some permissions issue after upgrade
/assign @davidz627

tpepper · 2018-09-20T22:32:09Z

@davidz627 has triaged this to an understanding that the failures happen because the pod is scheduled on the master and getting logs through the method they're using is disabled on master nodes.

This makes me think this is similar to #68881 (review) and we have non-release-blocking issues in test cases not being quite robust enough now in the face of node taints.

tpepper · 2018-09-20T22:34:05Z

It is interesting that in some places this test is always failing, some it is always succeeding, and even in one it flakes but with a high success rate: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster

davidz627 · 2018-09-20T22:34:56Z

Adding on to @tpepper's comment:
The pod that we are creating does not have any special "master schedulable" annotations or tolerations, so from my understanding this should still be not scheduled to the master for any reason.

The Pod Spec in question:

&v1.Pod{
	TypeMeta: metav1.TypeMeta{
		Kind:       "Pod",
		APIVersion: "v1",
	},
	ObjectMeta: metav1.ObjectMeta{
		GenerateName: "pvc-tester-",
		Namespace:    ns,
	},
	Spec: v1.PodSpec{
		Containers: []v1.Container{
			{
				Name:    "write-pod",
				Image:   BusyBoxImage,
				Command: []string{"/bin/sh"},
				Args:    []string{"-c", command},
				SecurityContext: &v1.SecurityContext{
					Privileged: &isPrivileged,
				},
			},
		},
		RestartPolicy: v1.RestartPolicyOnFailure,
	},
}

Potentially relevant scheduler logs.

I0920 05:11:58.570729       1 factory.go:1286] About to try and schedule pod e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l
I0920 05:11:58.570764       1 scheduler.go:403] Attempting to schedule pod: e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l
I0920 05:11:58.570918       1 scheduler_binder.go:509] All bound volumes for Pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l" match with Node "bootstrap-e2e-master"
I0920 05:11:58.570954       1 scheduler_binder.go:509] All bound volumes for Pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l" match with Node "bootstrap-e2e-minion-group-7hhn"
I0920 05:11:58.570984       1 scheduler_binder.go:509] All bound volumes for Pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l" match with Node "bootstrap-e2e-minion-group-mt5f"
I0920 05:11:58.571008       1 scheduler_binder.go:509] All bound volumes for Pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l" match with Node "bootstrap-e2e-minion-group-t0d1"
I0920 05:11:58.571116       1 scheduler_binder.go:200] AssumePodVolumes for pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l", node "bootstrap-e2e-master"
I0920 05:11:58.571128       1 scheduler_binder.go:203] AssumePodVolumes for pod "e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l", node "bootstrap-e2e-master": all PVCs bound and nothing to do
I0920 05:11:58.571155       1 factory.go:1512] Attempting to bind pvc-tester-j4c4l to bootstrap-e2e-master

msau42 · 2018-09-20T22:44:31Z

The fix mentioned in #68881 is only handling DaemonSets.

I don't think it will fix this issue.

msau42 · 2018-09-20T22:45:18Z

Do we need to get sig-scheduling involved to see why the master is showing up as a choice for this pod?

tpepper · 2018-09-20T22:45:53Z

/sig scheduling
@bsalamat
Can you triage this for 1.12 release blocking?

janetkuo · 2018-09-20T22:57:46Z

It seems that this pvc-tester- pod is only scheduled to master in every run which caused the test to fail consistently.

I0920 05:23:34.239] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:11:58 +0000 UTC - event for pvc-tester-j4c4l: {default-scheduler } Scheduled: Successfully assigned e2e-tests-persistent-volume-upgrade-sig-storage-xsxmg/pvc-tester-j4c4l to bootstrap-e2e-master
I0920 05:23:34.239] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:14 +0000 UTC - event for pvc-tester-j4c4l: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pv-upgrader7fm4" 
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Pulling: pulling image "docker.io/library/busybox:1.29"
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Pulled: Successfully pulled image "docker.io/library/busybox:1.29"
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Created: Created container
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:28 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Started: Started container

davidz627 · 2018-09-20T23:00:23Z

@janetkuo
However in this test suite, it seems like the test passes most of the time (pod is scheduled to a node), then occasionally flakes (pod scheduled to master).: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster

Whereas in this one it doesn't seem to fail at all!: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-new-gci-master-upgrade-cluster-parallel

I talked to @krzyzacy about why the behavior might be different across different test suites but we were unable to come up with anything concrete.

bsalamat · 2018-09-20T23:09:28Z

@davidz627 Is the master node in your test environment tainted properly?

msau42 · 2018-09-20T23:13:13Z

When we were doing unrelated development with DaemonSets, we noticed a similar issue where our DaemonSet pod would get scheduled on gce environment using hack/cluster-up.sh, and we noticed masters brought up using that script were not tainted

msau42 · 2018-09-20T23:14:24Z

What release was the tainting master logic added?

krzyzacy · 2018-09-20T23:15:49Z

cc @yguo0905 we had similar issue that gpu driver installer is being scheduled on master at #68372, can also be related?

davidz627 · 2018-09-20T23:17:10Z

Not sure if this is a source of truth on the cluster but the test artifacts here on a successful run show that the master node doesn't have a taint:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2480/artifacts/nodes.yaml

and on a failed run ALSO no taint:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2477/artifacts/nodes.yaml

But I'm not really sure where this nodes.yaml file comes from. And whether this is the cluster before or after the upgrade

tpepper · 2018-09-20T23:17:16Z

ie: #68492

msau42 · 2018-09-20T23:19:07Z

Don't think #68492 will help. It looks specific to gpu labeling

tpepper · 2018-09-21T00:38:37Z

Also ^^ is GKE specific, and GKE docu says no taints are inherited to a pool from the cluster, so shouldn't be something from that direction.

tpepper · 2018-09-21T00:48:57Z

Some log snippets around the area of interest:

I0920 05:11:57.830] Found 4 node(s).
I0920 05:11:57.979] NAME                              STATUS                     ROLES    AGE   VERSION
I0920 05:11:57.980] bootstrap-e2e-master              Ready,SchedulingDisabled   <none>   16m   v1.13.0-alpha.0.1353+d1111a57d9243c
I0920 05:11:57.980] bootstrap-e2e-minion-group-7hhn   Ready                      <none>   15m   v1.11.4-beta.0.25+f818aa3443bc98
I0920 05:11:57.980] bootstrap-e2e-minion-group-mt5f   Ready                      <none>   15m   v1.11.4-beta.0.25+f818aa3443bc98
I0920 05:11:57.980] bootstrap-e2e-minion-group-t0d1   Ready                      <none>   16m   v1.11.4-beta.0.25+f818aa3443bc98

I0920 05:23:27.654] Logging node info for node bootstrap-e2e-master
I0920 05:23:27.657] Sep 20 05:23:27.651: INFO: Node Info: &Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:bootstrap-e2e-master,GenerateName:,Namespace:,SelfLink:/api/v1/nodes/bootstrap-e2e-master,UID:75766387-bc91-11e8-80e8-42010a800002,ResourceVersion:4262,Generation:0,CreationTimestamp:2018-09-20 04:55:56 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{beta.kubernetes.io/arch: amd64,beta.kubernetes.io/fluentd-ds-ready: true,beta.kubernetes.io/instance-type: n1-standard-1,beta.kubernetes.io/metadata-proxy-ready: true,beta.kubernetes.io/os: linux,failure-domain.beta.kubernetes.io/region: us-central1,failure-domain.beta.kubernetes.io/zone: us-central1-f,kubernetes.io/hostname: bootstrap-e2e-master,},Annotations:map[string]string{node.alpha.kubernetes.io/ttl: 0,volumes.kubernetes.io/controller-managed-attach-detach: true,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:NodeSpec{PodCIDR:10.64.1.0/24,DoNotUse_ExternalID:,ProviderID:gce://gce-cvm-upg-lat-1-3-ctl-skew/us-central1-f/bootstrap-e2e-master,Unschedulable:true,Taints:[{node.kubernetes.io/unschedulable  NoSchedule 2018-09-20 05:12:08 +0000 UTC}],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},ephemeral-storage: {{16684785664 0} {<nil>}  BinarySI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{3885408256 0} {<nil>}  BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Allocatable:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},ephemeral-storage: {{15016307073 0} {<nil>} 15016307073 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{3623264256 0} {<nil>}  BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Phase:,Conditions:[{NetworkUnavailable False 2018-09-20 04:56:13 +0000 UTC 2018-09-20 04:56:13 +0000 UTC RouteCreated RouteController created a route} {OutOfDisk False 2018-09-20 05:23:21 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientDisk kubelet has sufficient disk space available} {MemoryPressure False 2018-09-20 05:23:21 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2018-09-20 05:23:21 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2018-09-20 05:23:21 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2018-09-20 05:23:21 +0000 UTC 2018-09-20 04:56:16 +0000 UTC KubeletReady kubelet is posting ready status. AppArmor enabled}],Addresses:[{InternalIP 10.128.0.2} {ExternalIP 35.239.250.148} {InternalDNS bootstrap-e2e-master.c.gce-cvm-upg-lat-1-3-ctl-skew.internal} {Hostname bootstrap-e2e-master.c.gce-cvm-upg-lat-1-3-ctl-skew.internal}],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:10250,},},NodeInfo:NodeSystemInfo{MachineID:ecdd0ef5fc22c16c77ade90d525280e9,SystemUUID:ECDD0EF5-FC22-C16C-77AD-E90D525280E9,BootID:8a45557c-4637-4e10-8939-cfc8fe158476,KernelVersion:4.4.111+,OSImage:Container-Optimized OS from Google,ContainerRuntimeVersion:docker://17.3.2,KubeletVersion:v1.13.0-alpha.0.1353+d1111a57d9243c,KubeProxyVersion:v1.13.0-alpha.0.1353+d1111a57d9243c,OperatingSystem:linux,Architecture:amd64,},Images:[{[gcr.io/stackdriver-agents/stackdriver-logging-agent@sha256:c34bbdcd4585e3391a22f4cfca8ffc336eb0b9b8e0942a1da0f62ab4d0db0fc8 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.30-1-k8s] 544936438} {[k8s.gcr.io/etcd@sha256:adf0009b3ea09d82fb15c7649a7436e0635cb51c74c5d58796db6c3c52713a1c k8s.gcr.io/etcd:3.2.24-0] 219655340} {[gcr.io/google_containers/kube-apiserver:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-apiserver:v1.13.0-alpha.0.1353_d1111a57d9243c] 193572744} {[gcr.io/google_containers/kube-controller-manager:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-controller-manager:v1.13.0-alpha.0.1353_d1111a57d9243c] 163792042} {[k8s.gcr.io/prometheus-to-sd@sha256:6c0c742475363d537ff059136e5d5e4ab1f512ee0fd9b7ca42ea48bc309d1662 k8s.gcr.io/prometheus-to-sd:v0.3.1] 88077694} {[k8s.gcr.io/ingress-gce-glbc-amd64@sha256:14f14351a03038b238232e60850a9cfa0dffbed0590321ef84216a432accc1ca k8s.gcr.io/ingress-gce-glbc-amd64:v1.2.3] 71797285} {[k8s.gcr.io/kube-addon-manager@sha256:d53486c3a0b49ebee019932878dc44232735d5622a51dbbdcec7124199020d09 k8s.gcr.io/kube-addon-manager:v8.7] 63322109} {[k8s.gcr.io/prometheus-to-sd@sha256:71b2389fc4931b1cc3bb27ba1873361c346950558dd2220beb02ab6b31d3a651 k8s.gcr.io/prometheus-to-sd:v0.2.4] 58688412} {[gcr.io/google_containers/kube-scheduler:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-scheduler:v1.13.0-alpha.0.1353_d1111a57d9243c] 58152498} {[k8s.gcr.io/etcd-empty-dir-cleanup@sha256:c9c0dea9dbdbdaca436e33947a7559dbe2a60ad2bfa4e77518cb7088e443e28a k8s.gcr.io/etcd-empty-dir-cleanup:3.2.24.0] 31724043} {[k8s.gcr.io/metadata-proxy@sha256:5be758058e67b578f7041498e2daca46ccd7426bc602d35ed0f50bd4a3659d50 k8s.gcr.io/metadata-proxy:v0.1.10] 8953717} {[gcr.io/kubernetes-e2e-test-images/serve-hostname@sha256:bab70473a6d8ef65a22625dc9a1b0f0452e811530fdbe77e4408523460177ff1 gcr.io/kubernetes-e2e-test-images/serve-hostname:1.1] 5851985} {[busybox@sha256:5e8e0509e829bb8f990249135a36e81a3ecbe94294e7a185cc14616e5fad96bd busybox:1.29] 1162769} {[k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea k8s.gcr.io/pause:3.1] 742472}],VolumesInUse:[],VolumesAttached:[],Config:&NodeConfigStatus{Assigned:nil,Active:nil,LastKnownGood:nil,Error:,},},}

I0920 05:23:34.239] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:14 +0000 UTC - event for pvc-tester-j4c4l: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pv-upgrader7fm4" 
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Pulling: pulling image "docker.io/library/busybox:1.29"
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Pulled: Successfully pulled image "docker.io/library/busybox:1.29"
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:24 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Created: Created container
I0920 05:23:34.240] Sep 20 05:23:34.238: INFO: At 2018-09-20 05:12:28 +0000 UTC - event for pvc-tester-j4c4l: {kubelet bootstrap-e2e-master} Started: Started container

I0920 05:23:34.268] Logging node info for node bootstrap-e2e-master
I0920 05:23:34.271] Sep 20 05:23:34.265: INFO: Node Info: &Node{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:bootstrap-e2e-master,GenerateName:,Namespace:,SelfLink:/api/v1/nodes/bootstrap-e2e-master,UID:75766387-bc91-11e8-80e8-42010a800002,ResourceVersion:4290,Generation:0,CreationTimestamp:2018-09-20 04:55:56 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{beta.kubernetes.io/arch: amd64,beta.kubernetes.io/fluentd-ds-ready: true,beta.kubernetes.io/instance-type: n1-standard-1,beta.kubernetes.io/metadata-proxy-ready: true,beta.kubernetes.io/os: linux,failure-domain.beta.kubernetes.io/region: us-central1,failure-domain.beta.kubernetes.io/zone: us-central1-f,kubernetes.io/hostname: bootstrap-e2e-master,},Annotations:map[string]string{node.alpha.kubernetes.io/ttl: 0,volumes.kubernetes.io/controller-managed-attach-detach: true,},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Spec:NodeSpec{PodCIDR:10.64.1.0/24,DoNotUse_ExternalID:,ProviderID:gce://gce-cvm-upg-lat-1-3-ctl-skew/us-central1-f/bootstrap-e2e-master,Unschedulable:true,Taints:[{node.kubernetes.io/unschedulable  NoSchedule 2018-09-20 05:12:08 +0000 UTC}],ConfigSource:nil,},Status:NodeStatus{Capacity:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},ephemeral-storage: {{16684785664 0} {<nil>}  BinarySI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{3885408256 0} {<nil>}  BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Allocatable:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},ephemeral-storage: {{15016307073 0} {<nil>} 15016307073 DecimalSI},hugepages-2Mi: {{0 0} {<nil>} 0 DecimalSI},memory: {{3623264256 0} {<nil>}  BinarySI},pods: {{110 0} {<nil>} 110 DecimalSI},},Phase:,Conditions:[{NetworkUnavailable False 2018-09-20 04:56:13 +0000 UTC 2018-09-20 04:56:13 +0000 UTC RouteCreated RouteController created a route} {OutOfDisk False 2018-09-20 05:23:31 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientDisk kubelet has sufficient disk space available} {MemoryPressure False 2018-09-20 05:23:31 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2018-09-20 05:23:31 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2018-09-20 05:23:31 +0000 UTC 2018-09-20 04:55:56 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready True 2018-09-20 05:23:31 +0000 UTC 2018-09-20 04:56:16 +0000 UTC KubeletReady kubelet is posting ready status. AppArmor enabled}],Addresses:[{InternalIP 10.128.0.2} {ExternalIP 35.239.250.148} {InternalDNS bootstrap-e2e-master.c.gce-cvm-upg-lat-1-3-ctl-skew.internal} {Hostname bootstrap-e2e-master.c.gce-cvm-upg-lat-1-3-ctl-skew.internal}],DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:10250,},},NodeInfo:NodeSystemInfo{MachineID:ecdd0ef5fc22c16c77ade90d525280e9,SystemUUID:ECDD0EF5-FC22-C16C-77AD-E90D525280E9,BootID:8a45557c-4637-4e10-8939-cfc8fe158476,KernelVersion:4.4.111+,OSImage:Container-Optimized OS from Google,ContainerRuntimeVersion:docker://17.3.2,KubeletVersion:v1.13.0-alpha.0.1353+d1111a57d9243c,KubeProxyVersion:v1.13.0-alpha.0.1353+d1111a57d9243c,OperatingSystem:linux,Architecture:amd64,},Images:[{[gcr.io/stackdriver-agents/stackdriver-logging-agent@sha256:c34bbdcd4585e3391a22f4cfca8ffc336eb0b9b8e0942a1da0f62ab4d0db0fc8 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.30-1-k8s] 544936438} {[k8s.gcr.io/etcd@sha256:adf0009b3ea09d82fb15c7649a7436e0635cb51c74c5d58796db6c3c52713a1c k8s.gcr.io/etcd:3.2.24-0] 219655340} {[gcr.io/google_containers/kube-apiserver:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-apiserver:v1.13.0-alpha.0.1353_d1111a57d9243c] 193572744} {[gcr.io/google_containers/kube-controller-manager:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-controller-manager:v1.13.0-alpha.0.1353_d1111a57d9243c] 163792042} {[k8s.gcr.io/prometheus-to-sd@sha256:6c0c742475363d537ff059136e5d5e4ab1f512ee0fd9b7ca42ea48bc309d1662 k8s.gcr.io/prometheus-to-sd:v0.3.1] 88077694} {[k8s.gcr.io/ingress-gce-glbc-amd64@sha256:14f14351a03038b238232e60850a9cfa0dffbed0590321ef84216a432accc1ca k8s.gcr.io/ingress-gce-glbc-amd64:v1.2.3] 71797285} {[k8s.gcr.io/kube-addon-manager@sha256:d53486c3a0b49ebee019932878dc44232735d5622a51dbbdcec7124199020d09 k8s.gcr.io/kube-addon-manager:v8.7] 63322109} {[k8s.gcr.io/prometheus-to-sd@sha256:71b2389fc4931b1cc3bb27ba1873361c346950558dd2220beb02ab6b31d3a651 k8s.gcr.io/prometheus-to-sd:v0.2.4] 58688412} {[gcr.io/google_containers/kube-scheduler:v1.13.0-alpha.0.1353_d1111a57d9243c k8s.gcr.io/kube-scheduler:v1.13.0-alpha.0.1353_d1111a57d9243c] 58152498} {[k8s.gcr.io/etcd-empty-dir-cleanup@sha256:c9c0dea9dbdbdaca436e33947a7559dbe2a60ad2bfa4e77518cb7088e443e28a k8s.gcr.io/etcd-empty-dir-cleanup:3.2.24.0] 31724043} {[k8s.gcr.io/metadata-proxy@sha256:5be758058e67b578f7041498e2daca46ccd7426bc602d35ed0f50bd4a3659d50 k8s.gcr.io/metadata-proxy:v0.1.10] 8953717} {[gcr.io/kubernetes-e2e-test-images/serve-hostname@sha256:bab70473a6d8ef65a22625dc9a1b0f0452e811530fdbe77e4408523460177ff1 gcr.io/kubernetes-e2e-test-images/serve-hostname:1.1] 5851985} {[busybox@sha256:5e8e0509e829bb8f990249135a36e81a3ecbe94294e7a185cc14616e5fad96bd busybox:1.29] 1162769} {[k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea k8s.gcr.io/pause:3.1] 742472}],VolumesInUse:[],VolumesAttached:[],Config:&NodeConfigStatus{Assigned:nil,Active:nil,LastKnownGood:nil,Error:,},},}

I0920 05:25:20.039] • Failure [1515.350 seconds]
I0920 05:25:20.039] [sig-cluster-lifecycle] Upgrade [Feature:Upgrade]
I0920 05:25:20.039] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/framework.go:22
I0920 05:25:20.039]   master upgrade
I0920 05:25:20.039]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:90
I0920 05:25:20.039]     should maintain a functioning cluster [Feature:MasterUpgrade] [It]
I0920 05:25:20.040]     /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:91
I0920 05:25:20.040] 
I0920 05:25:20.040]     Expected error:
I0920 05:25:20.040]         <*errors.errorString | 0xc420a4b980>: {
I0920 05:25:20.040]             s: "failed to get logs from pvc-tester-j4c4l for write-pod: the server does not allow this method on the requested resource (get pods pvc-tester-j4c4l)",
I0920 05:25:20.040]         }
I0920 05:25:20.040]         failed to get logs from pvc-tester-j4c4l for write-pod: the server does not allow this method on the requested resource (get pods pvc-tester-j4c4l)
I0920 05:25:20.040]     not to have occurred
I0920 05:25:20.041] 
I0920 05:25:20.041]     /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/util.go:2350
I0920 05:25:20.041] ------------------------------

tpepper · 2018-09-21T00:56:36Z

/priority critical-urgent

msau42 · 2018-09-21T01:05:00Z

I think the fix is obvious: we need to taint the master node on the gce master setup in cluster/.

What I still don't understand is how did we get this far without anything else failing, and why only this one test?

msau42 · 2018-09-21T01:16:34Z

Do taints persist across master upgrades, or do they need to be re-applied again?

bsalamat · 2018-09-21T19:12:24Z

cc @dashpole for what we can do on the node side.
cc @tallclair for the security/policy related matters associated with adding taints for node condition

tpepper · 2018-09-21T19:38:20Z

@kubernetes/sig-node-bugs help please

msau42 · 2018-09-21T20:08:17Z

@bsalamat regarding your example "a pod that tolerates certain node conditions might not be scheduled if the scheduler checked both node taints and node condition", could this be an API compatibility issue? Will a pod that tolerates a node condition in 1.11 fail to schedule in 1.12 because the scheduler no longer looks at node conditions and only node taints?

bsalamat · 2018-09-21T20:14:16Z

@msau42 Pods were not able to tolerate node conditions by setting some API fields. We had some annotations, such as pod critical annotation and other hard-coded logic in DS controller, etc. that would have the same effect of tolerating conditions, but they were not in our API.

msau42 · 2018-09-21T20:21:36Z

Thanks for the explanation. It looks like the critical pod annotation was alpha, so does not have a backwards compatibility requirement, and we've fixed all the known users of it.

tallclair · 2018-09-21T20:22:11Z

xref: #63897 - cc @mikedanese @liggitt

Should this be handled by the node controller? I.e. if the Unschedulable bool is set on the node, add the unschedulable taint?

dashpole · 2018-09-21T20:25:39Z

I don't believe the kubelet has the ability to update its taints anymore. Assuming this is the case, this seems like a version skew problem between master components. I.E. the 1.11 controller manager isn't compatible with the 1.13 scheduler because it doesn't add taints that the scheduler requires. This means that if the scheduler is upgraded before the controller manager (or before it has the chance to perform updates), we can run into issues.

I think making the scheduler respect conditions in 1.13, and thus making it compatible with the 1.11 controller manager, seems like the best remedy to this. Regardless of this issue, user pods would need to include both the critical annotation as well as the toleration in order to maintain the same behavior across releases that transition from conditions to taints.

tallclair · 2018-09-21T20:25:40Z

Oh, I see that the node controller is already doing that, but there's a race condition (#68899 (comment)), sorry I missed that earlier.

liggitt · 2018-09-21T20:28:07Z

Yes, translation of conditions and unschedulable field to taints is done by the controller

bsalamat · 2018-09-21T20:33:24Z

I wonder if we can change our upgrade flow to ensure that the controller manager is upgraded and becomes stable (adds the taints) before the scheduler is upgraded.

liggitt · 2018-09-21T20:34:30Z

I don't believe the kubelet has the ability to update its taints anymore.

The kubelet has never required the ability to update taints.

Assuming this is the case, this seems like a version skew problem between master components. I.E. the 1.11 controller manager isn't compatible with the 1.13 scheduler because it doesn't add taints that the scheduler requires.

Was that a typo, supposed to be 1.12? Have we ever supported two version skew between those components?

This means that if the scheduler is upgraded before the controller manager (or before it has the chance to perform updates), we can run into issues.

Under what circumstances do we upgrade to 1.13 scheduler while a 1.11 controller manager is still running?

I think making the scheduler respect conditions in 1.13, and thus making it compatible with the 1.11 controller manager, seems like the best remedy to this.

Same question about two version skew between these two components.

Regardless of this issue, user pods would need to include both the critical annotation as well as the toleration in order to maintain the same behavior across releases that transition from conditions to taints.

That's more concerning… is that something all user pods would be exposed to and required to change?

yujuhong · 2018-09-21T20:34:56Z

+1 to @dashpole's suggestion in #68899 (comment) to unblock the release

msau42 · 2018-09-21T20:44:38Z

What about HA master upgrade?

scheduler and controller-manager each have independent leader elections, so during a HA rolling upgrade, the current leaders for scheduler and controller managers could be on different releases in either direction

liggitt · 2018-09-21T20:49:36Z

during a HA rolling upgrade, the current leaders for scheduler and controller managers could be on different releases in either direction

I wouldn't expect an upgrade from 1.12 to 1.13 to begin until the upgrade from 1.11 to 1.12 was complete/successful

dashpole · 2018-09-21T20:55:47Z

Was that a typo, supposed to be 1.12? Have we ever supported two version skew between those components?

Thanks for correcting me. It wasn't a typo, but just a lack of knowledge about how multi-version upgrades are performed. In that case we need to support 1 version skew.

Under what circumstances do we upgrade to 1.13 scheduler while a 1.11 controller manager is still running?

Yeah, it should be 1.12, not 1.13. But generally, the newer scheduler should still be compatible with the state left by the previous version's controller manager, even if it isn't running anymore.

That's more concerning… is that something all user pods would be exposed to and required to change?

Actually, I agree. @bsalamat have we officially deprecated the critical pod annotation? We should ideally add the correct tolerations to pods with the critical pod annotation to ensure backwards compatibility.

msau42 · 2018-09-21T20:59:16Z

Bobby commented here that the system pods have already been corrected. I also noted that the critical pod annotation is alpha.

msau42 · 2018-09-21T21:02:19Z

during a HA rolling upgrade, the current leaders for scheduler and controller managers could be on different releases in either direction

I wouldn't expect an upgrade from 1.12 to 1.13 to begin until the upgrade from 1.11 to 1.12 was complete/successful

I mean doing an HA rolling upgrade from 1.11 to 1.12. Let's say we have 3 masters, A, B, C, all on 1.11.

A is upgraded to 1.12, B and C are still on 1.11.
scheduler leader election could pick scheduler A as the leader, controller-manager leader election could pick controller-manager B as the leader. Now we have version skew between master components.

bsalamat · 2018-09-21T21:03:51Z

have we officially deprecated the feature? We should ideally add the correct tolerations to pods with the critical pod annotation to ensure backwards compatibility.

That feature (critical pods) was an alpha feature and we have already replaced it. The dilemma is that I am not so sure if critical pods were the only kind of pods with a "hack" for tolerating node conditions. Let me dig a bit more.

liggitt · 2018-09-21T21:14:12Z

Now we have version skew between master components.

One version of skew, absolutely. @dashpole's comment made me concerned two versions of skew support were being claimed/proposed

bsalamat · 2018-09-21T21:57:28Z

I looked the scheduler code and the DaemonSet controller code to find if adding "CheckNodeCondition" could cause issues in presence of TaintNodeByCondition and I didn't find anything particularly concerning. So, I created #68953. Let's see if it breaks tests.

k82cn · 2018-09-21T22:18:56Z

that will break schedule daemonset by scheduler. and registerwithtaints of kubelet is not worked in upgrade.

bsalamat · 2018-09-21T23:23:16Z

that will break schedule daemonset by scheduler. and registerwithtaints of kubelet is not worked in upgrade.

I suspected that it would conflict with pods which now have tolerations and the scheduler should ignore node conditions for, but I thought it would be limited to critical pods. Thanks for pointing it out. I closed the PR.
@k82cn As an alternative solution, I suggested changing the order of upgrading server components in our upgrade process, so that the controller manager is upgraded before the scheduler, but I am not sure how easy it is to implement. Do you have any alternative fix in mind?

k82cn · 2018-09-22T00:45:20Z

Yes, translation of conditions and unschedulable field to taints is done by the controller

unschedulable is taint by controller & kubelet; #63955 taints node with unschedulable to avoid race condition between controller and scheduler, it dependents on registerWithTaints of kubelet to taint when start up; but registerWithTaints does not work in upgrade case as mentioned above (we only update status when upgrading). IMO, it should be fixed; it'll also fix upgrade e2e test but may not work in real scenario which upgrade master firstly.

so that the controller manager is upgraded before the scheduler, but I am not sure how easy it is to implement. Do you have any alternative fix in mind?

let me think about it, I'll propose one today.

liggitt · 2018-09-22T00:52:55Z

it dependents on registerWithTaints of kubelet to taint when start up; but registerWithTaints does not work in upgrade case as mentioned above (we only update status when upgrading). IMO, it should be fixed

The race that is important to resolve at creation time is that of a new node object appearing untainted.

That race does not apply in an upgrade scenario, where the node object already exists, and can already be observed and acted upon. Relying on the condition-to-taint controller in upgrade scenarios is reasonable.

k82cn · 2018-09-22T13:47:27Z

The race after kubelet start up is well known; but we did not handle startup case well.

I think what we're doing is to deprecate node.Spec.Unschedulable, and replace it with unschedulable taint. We need to keep backward compatibility for it.

k8s-ci-robot assigned msau42 Sep 20, 2018

k8s-ci-robot added this to the v1.12 milestone Sep 20, 2018

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Sep 20, 2018

k8s-ci-robot assigned davidz627 Sep 20, 2018

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Sep 20, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 21, 2018

bsalamat mentioned this issue Sep 21, 2018

Add CheckNodeCondition predicate of the scheduler #68953

Closed

k82cn mentioned this issue Sep 22, 2018

Keep backward compatibility for 'node.Spec.Unschedulable'. #68977

Closed

k82cn mentioned this issue Sep 23, 2018

Keep backward compatibility for 'node.Spec.Unschedulable'. #68984

Merged

tpepper mentioned this issue Sep 25, 2018

Keep backward compatibility for 'node.Spec.Unschedulable'. #69047

Merged

k8s-ci-robot closed this as completed in #68984 Sep 27, 2018

Failing Test : persistent-volume-upgrade [sig-storage] #68899

Failing Test : persistent-volume-upgrade [sig-storage] #68899

Comments

tpepper commented Sep 20, 2018 • edited Loading

Failing Job

Gubernator Logs

Error

Triage

msau42 commented Sep 20, 2018

tpepper commented Sep 20, 2018

tpepper commented Sep 20, 2018

davidz627 commented Sep 20, 2018 • edited Loading

msau42 commented Sep 20, 2018

msau42 commented Sep 20, 2018

tpepper commented Sep 20, 2018

janetkuo commented Sep 20, 2018

davidz627 commented Sep 20, 2018

bsalamat commented Sep 20, 2018

msau42 commented Sep 20, 2018 • edited Loading

msau42 commented Sep 20, 2018

krzyzacy commented Sep 20, 2018

davidz627 commented Sep 20, 2018 • edited Loading

tpepper commented Sep 20, 2018

msau42 commented Sep 20, 2018

tpepper commented Sep 21, 2018

tpepper commented Sep 21, 2018 • edited Loading

tpepper commented Sep 21, 2018

msau42 commented Sep 21, 2018

msau42 commented Sep 21, 2018

bsalamat commented Sep 21, 2018

tpepper commented Sep 21, 2018

msau42 commented Sep 21, 2018

bsalamat commented Sep 21, 2018 • edited Loading

msau42 commented Sep 21, 2018

tallclair commented Sep 21, 2018

dashpole commented Sep 21, 2018

tallclair commented Sep 21, 2018

liggitt commented Sep 21, 2018

bsalamat commented Sep 21, 2018

liggitt commented Sep 21, 2018

yujuhong commented Sep 21, 2018

msau42 commented Sep 21, 2018

liggitt commented Sep 21, 2018

dashpole commented Sep 21, 2018 • edited Loading

msau42 commented Sep 21, 2018

msau42 commented Sep 21, 2018

bsalamat commented Sep 21, 2018 • edited Loading

liggitt commented Sep 21, 2018

bsalamat commented Sep 21, 2018 • edited Loading

k82cn commented Sep 21, 2018

bsalamat commented Sep 21, 2018

k82cn commented Sep 22, 2018

liggitt commented Sep 22, 2018

k82cn commented Sep 22, 2018

tpepper commented Sep 20, 2018 •

edited

Loading

davidz627 commented Sep 20, 2018 •

edited

Loading

msau42 commented Sep 20, 2018 •

edited

Loading

davidz627 commented Sep 20, 2018 •

edited

Loading

tpepper commented Sep 21, 2018 •

edited

Loading

bsalamat commented Sep 21, 2018 •

edited

Loading

dashpole commented Sep 21, 2018 •

edited

Loading

bsalamat commented Sep 21, 2018 •

edited

Loading

bsalamat commented Sep 21, 2018 •

edited

Loading