Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Test: [sig-scheduling] TaintBasedEvictions [Serial] Checks that the node becomes unreachable #70627

Closed
jberkus opened this Issue Nov 4, 2018 · 14 comments

Comments

Projects
None yet
4 participants
@jberkus
Copy link

jberkus commented Nov 4, 2018

Which jobs are failing: ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new

Which test(s) are failing: [sig-scheduling] TaintBasedEvictions [Serial] Checks that the node becomes unreachable

Since when has it been failing: Nov. 3rd

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new

Reason for failure:

Proximate error:

Nov 4 17:57:57.062: node "bootstrap-e2e-minion-group-cshn" doesn't turn to NotReady after 3 minutes

However, this test has a whole bunch of disregarded RBAC errors, and I'm wondering if this is just a permissions problem. Filing an issue just so the scheduling folks take a look at it, in case it does represent a real upgrade problem.

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1772

Anything else we need to know:

This particular test is unreliable, and is likely to be taken out of blocking.

/kind failing-test
/sig scheduling
/sig cluster-lifecycle
/priority important-soon

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 4, 2018

@jberkus The e2e test "TaintBasedEvictions [Serial] Checks that the node becomes unreachable" simulates blocking network connection to master like this: sshing into a worker node, and issue an iptables command to block traffic to master, then wait for the worker node to be NotReady status.

And it passed in regular sig-scheduling dashboard. Not sure if there is something particular in the job "ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new". I will take a close look.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 4, 2018

From the log, there is no error on "sshing into the worker node and run iptables command to block traffic to master". I'm a little stuck why this worker node doesn't show NotReady..

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

One possibility is that in the upgrade testing env, there is only one worker node?

So that it enters a "fully disrupted" mode, and node lifecycle manager doesn't update node to NotReady state, per its design.

@jberkus may I know the output of kubectl get node -o wide on this "master upgrading" test env?

Edited: nvm, I noticed that there are 3 worker nodes.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

Another symptom I observed is: in regular e2e test env where this test passed, the log shows:

I1104 21:45:51.260] Nov  4 21:45:51.260: INFO: block network traffic from 104.155.153.70:22 to 104.154.244.35

But in the "master upgrade" env, it shows:

Nov  4 17:54:56.449: INFO: block network traffic from 104.198.207.223:22 to 35.193.37.31

It looks suspicious that the master ip is resolved to be 35.193.37.31.

And the e2e test uses same logic to retrieve master ip:

// GetMasterAddress returns the hostname/external IP/internal IP as appropriate for e2e tests on a particular provider
// which is the address of the interface used for communication with the kubelet.
func GetMasterAddress(c clientset.Interface) string {
master := getMaster(c)
switch TestContext.Provider {
case "gce", "gke":
return master.externalIP
case "aws":
return awsMasterIP
default:
Failf("This test is not supported for provider %s and should be disabled", TestContext.Provider)
}
return ""
}


Edited: the master ip resolving is good, but it's that during the master upgrade, internal communication from worker nodes to master got changed to use internal ip.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Nov 5, 2018

Link to original PR: #69796

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 5, 2018

@justinsb @roberthbailey to help answer any questions relating to upgrade test setup and env.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

I think the root cause is described in #56787:

After GCE cluster upgrade, the nodes talk to the master using the in-cluster IP.

  • From latest failure test (master upgrade env), I do see api-server logs showing it's communicating using private ip.
  • And I also checked the api-server log of regular e2e runnings, it's not using private ip.

Kindly ping @foxish: do we have plan to fix #56787 in 1.13? If not, I think I will have to work around this problem by disabling this e2e test in gce env.

Or, is there a way to disable this only in master upgrade env?

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

Also cc @bsalamat @k82cn @ravisantoshgudimetla for awareness.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 8, 2018

@justinsb I see that your PR #70681 merged as a fix for #56787. I see the test passing in the lagtest run https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new !! Thanks much.

Is there any more work needed @Huang-Wei or can we close this?

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 8, 2018

Latest running of ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new is green (after #70681 got merged).

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 8, 2018

cool.. lets watch for a few more runs and close this issue.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 11, 2018

Closing this issue now since latest run was green. Thanks much @justinsb and @Huang-Wei for quick turnaorunds

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 11, 2018

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 11, 2018

@AishSundar: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.