Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Fix e2e test: gang scheduling. #835

Closed
TommyLike opened this issue Apr 28, 2019 · 14 comments · Fixed by #848
Closed

Fix e2e test: gang scheduling. #835

TommyLike opened this issue Apr 28, 2019 · 14 comments · Fixed by #848
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@TommyLike
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
After the revert of batch job patches #806 , the gang scheduling testcase is falling, need investigate and fix.

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 28, 2019
@k82cn
Copy link
Contributor

k82cn commented Apr 29, 2019

@thandayuthapani , please help to check what happened :)

@thandayuthapani
Copy link
Contributor

@k82cn It is passing in my local test setup

@thandayuthapani
Copy link
Contributor

I'm able to find PodGroup Unschedulable event in local test setup, but in Travis is getting timeout at this step because it is not able to find that event or event not getting generated. Will check what is the problem.

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

@thandayuthapani , we found similar issue in Volcano which because of some change in e2e, refer to volcano-sh/volcano@c53779d#diff-8349006db2c242fd7424e1dfb3295840R430 for more detail.

@thandayuthapani
Copy link
Contributor

@k82cn Test Case failure is because of Unschedulable event is not generated so it is getting timed out waiting for unschedulable event and test case fails. Event is not generated because of fields in podGroupStatus in PodGroup object is reset to its nil value, while status updater function is called.
export1
Since object from API server is then assigned to our local cache object, local cache also looses status information. Unschedulable event get generated only when the phase of podGroup is pending or Unknown.
export
Attaching the logs of kube-batch, for better understanding. Status of podGroup is being determined in kube-batch and to update the status, status updater is called and the object returned is assigned back to cache. Returning object does not have status data. So it is getting missed.
export (1)

StatusUpdater uses K8s Update API call, when that is being replaced with UpdateStatus call, test case passed. Some problem with Update call, where status data is getting reset to its nil value.

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

StatusUpdater uses K8s Update API call, when that is being replaced with UpdateStatus call, test case passed.

Can you help to check history why it's failed recently?

@thandayuthapani
Copy link
Contributor

thandayuthapani commented May 7, 2019

It used to pass with local DINDv1.13 setup, but once I Cleaned that setup and brought up new setup of DIND-v1.13, I started facing the same problem as CI was facing. I think after DIND pulls new images for kubernetes components, it is facing this problem. Because it was not facing the problem in my old DIND setup with same code, but with new DIND setup(New images being pulled for kubernetes components), it was facing the issue, with no change in code. There was no change in DIND version, but only new images were being pulled by DIND clusters for kubernetes components.

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

but volcano-sh/kube-batch seems fine without this fix.

@thandayuthapani
Copy link
Contributor

That test case has been skipped in volcano/kube-batch

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

xref https://travis-ci.com/volcano-sh/volcano/jobs/198124601 , it has been fixed in the commit that I mentioned above.

@thandayuthapani
Copy link
Contributor

It volcano-sh/volcano we are using KIND to bring k8s cluster, but in kubernetes-sigs/kube-batch we use DIND to bring us k8s cluster.

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

then try kind; honestly, I'm unconfotable about the fix if we do not know the root cause :)

@thandayuthapani
Copy link
Contributor

thandayuthapani commented May 7, 2019

With Local KIND setup, gang scheduling and statement test case is passing, should I have to make change in Kube-Batch to use KIND now in CI?

@k82cn
Copy link
Contributor

k82cn commented May 7, 2019

ok, let's try kind :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants