Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rent Boskos project only once per test run. #405

Merged
merged 1 commit into from Jan 7, 2020

Conversation

xueweiz
Copy link
Contributor

@xueweiz xueweiz commented Jan 3, 2020

The e2e tests are running in parallel in different Ginkgo nodes (i.e. separate processes).

The old implementation is renting a Boskos project on each Ginkgo node. This is actually a waste of resource, because we already have VM-level isolation between test cases, we don't need project-level isolation. That could exhaust Boskos project pool.

The new implementation uses the SynchronizedBeforeSuite() and SynchronizedAfterSuite() functions to help. See the Managing Singleton External Processes in Parallel Test Suites subsection in the Ginkgo documentation for parallel tests.

Basically rentBoskosProjectIfNeededOnNode1() is called only once on Ginkgo node 1. It will rent a GCP project from Boskos, then pass the project name to all Ginkgo nodes. It will also spawn a goroutine to keep renewing the rented project.
The other Gingko nodes waits for node 1 to rent the project, and will then accept the project using the acceptBoskosProjectIfNeededFromNode1() function.
In the end, Gingko node 1 will wait for all other nodes to finish, then call the releaseBoskosResourcesOnNode1() function which releases the rented project.


Here is the logs from a sample run:

GO111MODULE=on ginkgo -nodes=3 -mod vendor -timeout=10m -v -tags " journald" -stream \
./test/e2e/metriconly/... -- \
-project= -zone=us-central1-a \
-image= -image-family=cos-73-lts -image-project=cos-cloud \
-ssh-user=prow -ssh-key=/etc/ssh-key-secret/ssh-private \
-npd-build-tar=`pwd`/node-problem-detector-v0.8.0-38-gc7efa05.tar.gz \
-boskos-project-type=gce-project -job-name=ci-npd-e2e-test \
-artifacts-dir=/logs/artifacts
...
[1] Parallel test node 1/3.
...
[1] Renting project from Boskos
...
[3] Parallel test node 3/3.
...
[2] Parallel test node 2/3.
...
[1] Rented project "k8s-boskos-gce-project-04" from Boskos
[1] Received Boskos project "k8s-boskos-gce-project-04" from Ginkgo node 1.
...
[3] Received Boskos project "k8s-boskos-gce-project-04" from Ginkgo node 1.
...
[2] Received Boskos project "k8s-boskos-gce-project-04" from Ginkgo node 1.
...
[2] PASS
...
[3] PASS
[1] Releasing all Boskos resources.
...
[1] PASS

Most importantly, note that only Ginkgo node 1 rented the project, and all other nodes simple accepts it from node 1. And that the resource is only released after all nodes have finished.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 3, 2020
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 3, 2020
The old implementation rents Boskos project for each Ginkgo node.
@xueweiz
Copy link
Contributor Author

xueweiz commented Jan 4, 2020

/assign @wangzhen127
/cc @andyxning

@wangzhen127
Copy link
Member

Just curious, has this been a problem before (despite waste of resources)?

@xueweiz
Copy link
Contributor Author

xueweiz commented Jan 6, 2020

No, there is no problem at all. The only thing that I'm concerned with is the waste of resources.

@wangzhen127
Copy link
Member

The ginkgo documentation has said "When possible, you should make every effort to start up a new instance of an external resource for every parallel node. This helps avoid test-pollution by strictly separating each parallel node." If there were no problem before, maybe we should just use the old way?

@xueweiz
Copy link
Contributor Author

xueweiz commented Jan 6, 2020

I agree with the principle that tests should be isolated to avoid pollution.
We already have this isolation by creating individual VM for each test case. Since the behavior we are testing are all guest OS / VM level behaviors, I think GCP project level isolation is an overkill. And I think this PR won't introduce any test pollution.

And although we don't see any resource deficiency at the moment, we have plans to expanding the test infra which will start eating away the Boskos project pool:

  1. We want CI jobs to run on different distros (cos-73-lts, cos-77-lts, ubuntu-1804-lts...). This would causes a 3-5x more resource usage.
  2. We want more behavior and performance tests. And to scale the tests and make them finish in bounded time (~10 minutes), we need to increase the parallelism of the tests.

I agree that this PR is not strictly necessary right now. And I agree that it could potentially add technical maintenance burden (of maintaining this synchronization). However I think the improvement out-weights the cost :)
If you think the maintenance burden may be more significant, I'm happy to hold/cancel this PR for now, and reopen it when we actually sees the resource deficiency :)

@wangzhen127
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wangzhen127, xueweiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [wangzhen127,xueweiz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit aadb2b8 into kubernetes:master Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants