Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: [sig-scheduling] SchedulerPriorities [Serial] Pod should be preferably scheduled to nodes which satisfy its limits #69989

Closed
mortent opened this Issue Oct 18, 2018 · 15 comments

Comments

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 18, 2018

/assign RaviSantosh Gudimetla

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 18, 2018

@AishSundar: GitHub didn't allow me to assign the following users: RaviSantosh, Gudimetla.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign RaviSantosh Gudimetla

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 18, 2018

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Oct 19, 2018

Looked at the kube-scheduler logs for GCE, seems the feature_gate was enabled.

Creating scheduler with fit predicates 'map[MaxCSIVolumeCountPred:{} GeneralPredicates:{} NoVolumeZoneConflict:{} MatchInterPodAffinity:{} CheckVolumeBinding:{} MaxAzureDiskVolumeCount:{} NoDiskConflict:{} CheckNodeUnschedulable:{} MaxEBSVolumeCount:{} MaxGCEPDVolumeCount:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{}]'

I don't see ResourceLimitsPriorityFunction enabled, will setup the cluster locally to test if this is the issue.

@jberkus

This comment has been minimized.

Copy link

jberkus commented Oct 23, 2018

@ravisantoshgudimetla have you been able to look into this? These tests are broken right now for unrelated reasons, but once they get fixed I expect to keep seeing the scheduler failure.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 24, 2018

/milestone v1.13

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Oct 24, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 30, 2018

@ravisantoshgudimetla is the one open PR #70401 expected to fix the remaining failures we are seeing the release dashboards? If not fixed by this week, these failures will block Beta cut targeted for 11/6.

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Oct 31, 2018

@AishSundar - Apologies for the delay. Yes, this PR should fix the failure, if not, I will disable this test for now and revert the feature promotion PR associated with this test. Also, I need to rethink, if e2e is an appropriate test scenario to handle this feature or should unit tests/integration suffice. I could not find a test which does something like resizing a node in the existing e2e test suite.

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Nov 1, 2018

I'm looking at this, and I'm not sure this is going to fix the failures.

I'm looking at the tests, and it seems to be setting the node status capacity/allocatable to pretend to be a 12 core machine, and then it tries to allocate a pod on every node such that every node is 50% utilized.

It thus tries to allocate a pod of ~ 6 cores on the resized node. But that node still only has 2 cores, so it fails to schedule:

Oct 31 13:57:26.840: INFO: At 2018-10-31 13:47:26 +0000 UTC - event for 80a6890c-dd13-11e8-b8c2-0a580a3c135f-0: {kubelet bootstrap-e2e-minion-group-s6qw} OutOfcpu: Node didn't have enough resource: cpu, requested: 5250, used: 450, capacity: 2000

I think that's why the test has been failing, at least for the past few runs which I checked.

What I have yet to figure out is how this passed here... https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/1750

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 1, 2018

@justinsb is it safe to rephrase this as a pure test design issue then?

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 2, 2018

@ravisantoshgudimetla the test still failed on the last run #4295. Even though it was a couple of hours after PR #70401 merged, I dont see it in the delta changes for that test run. I will watch for the next run result.

Assuming it still doesnt pass (as @justinsb pointed out about) we need to figure out what this e2e test failure means and should it block 1.13 Beta0 for next Tuesday. Considering the test did pass very few times since its introduction, this seems like an issue with the test design itself and not the feature. Do you have other tests (unit or integration or different approach to e2e) that can infirm us on the health of thew feature? If those exist and are green we can unblock Beta on those results while you determine the right way to author the e2e. Barring those being present right idea, I think we will have to err on the side of caution and roll back the feature itself before Beta. Either way we need to take a call and submit relevant PRs by tomorrow so we get green CI over the weekend, in time for Beta cut on Tuesday.

@jberkus

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Nov 2, 2018

Assuming it still doesnt pass (as @justinsb pointed out about) we need to figure out what this e2e test failure means and should it block 1.13 Beta0 for next Tuesday.

This is not a blocker, as I told earlier as well, e2e may not be good avenue for this test. I actually thought more about this problem last day. The problem is I updated the node capacity and I was checking if the node could handle the pod based on the scheduler cache which we get from kubelet. But kubelet on the other side is updating it's node information(like CPU, memory etc) in the API server there by causing a race condition. So, it could pass with a very little chance. So, I will remove this test from e2e.

Do you have other tests (unit or integration or different approach to e2e) that can infirm us on the health of thew feature?

Yes, we have unit tests, that help in validating the feature. I will try to get an integration test but this is not a blocker for feature.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 2, 2018

Thanks for the explanation and analysis @ravisantoshgudimetla. do you plan on adding more e2e tests later or stick to unit and integration for validating this feature? @bsalamat, @jberkus and @justinsb to weigh in on the path forward i.e., remove e2e tests for this feature and rely on unit and integration.

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Nov 2, 2018

@AishSundar - We discussed internally in the sig meeting yesterday and it was decided to have an integration test instead of e2e test. I plan to add an integration test(which we don't have today) later.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 6, 2018

@ravisantoshgudimetla since this PR is actually tracking an enhancement work graduating to Beta in 1.13, it should have been tracked via a corresponding issue in k/enhancements repo in 1.13 timeframe. We are now past Feature freeze for 1.13 and its late to open an issue against it. Also we are missing integration tests for this feature, which we are solely relying on for qualification.

Given the above two points and from offline discussions with SIG leads regarding the criticality of this feature going to Beta now vs in 1.14 timeframe, I am asking for it to be punted to 1.14.

As AIs

@kacole2 @bsalamat @davidopp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.