Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu cluster upgrade test. #63631

Merged
merged 1 commit into from
Jun 5, 2018

Conversation

jiayingz
Copy link
Contributor

@jiayingz jiayingz commented May 9, 2018

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506

Release note:


@k8s-ci-robot
Copy link
Contributor

@jiayingz: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 9, 2018
@k8s-ci-robot k8s-ci-robot requested review from enisoc and enj May 9, 2018 23:42
@jiayingz
Copy link
Contributor Author

jiayingz commented May 9, 2018

/cc @vishh @mindprince

t.createdPod = podClient.Create(testPod)
}

// testPod creates a pod that requests gpu and runs a simple cuda job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not right!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

framework.ExpectNoError(framework.MasterUpgrade(target))
framework.ExpectNoError(framework.CheckMasterVersion(f.ClientSet, target))
}
runUpgradeSuite(f, gpuUpgradeTests, testFrameworks, testSuite, upgCtx, upgrades.ClusterUpgrade, upgradeFunc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be upgrades.MasterUpgrade instead of upgrades.ClusterUpgrade?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// in a "Describe".
testFrameworks := createUpgradeFrameworks(gpuUpgradeTests)
Describe("master upgrade", func() {
It("should NOT disrrupt gpu pod [Feature:GPUMasterUpgrade]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be disrupt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// startPod creates a pod that requests gpu and runs a simple cuda job.
func (t *NvidiaGPUUpgradeTest) startPod(f *framework.Framework) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse makeCudaAdditionDevicePluginTestPod?

// Test waits for the upgrade to complete, and then verifies that the
// cuda pod can successfully finish.
func (t *NvidiaGPUUpgradeTest) Test(f *framework.Framework, done <-chan struct{}, upgrade UpgradeType) {
<-done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of master-upgrade-test, we start the pod before the upgrade (in Setup()), then trigger the upgrade and then wait for the upgrade to finish and check if the pod succeeded or not.

Is the assumption here that upgrade will begin before the pod has already completed? Because otherwise the test is not testing anything.

Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 May 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mindprince Since pod will never be run on master, does it still make sense to test something related to pod lifecycle in master-upgrade-test case?
I think intention here is to test that pod is still running even after master upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to use Job instead of Pod to verify GPU resource can be consumed correctly in case of both master upgrade and node upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not run the vector add in a loop and check that there have been no failures? As of now as @mindprince mentioned, the job could in theory succeed prior to (or during) master upgrade which makes the test non-deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job is configured to continuously run for 100 times, which should cover the upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I added some comment to make this clear.

@vikaschoudhary16
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 16, 2018
scheduling.SetupNVIDIAGPUNode(f, false)
By("Creating a pod requesting gpu")
t.startPod(f)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should verifyPodSuccess be invoked here before returning from setup()?

// Create the frameworks here because we can only create them
// in a "Describe".
testFrameworks := createUpgradeFrameworks(gpuUpgradeTests)
Describe("master upgrade", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "node upgrade" is skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node upgrade is exercised through cluster upgrade during which we first upgrade master to the target version and then upgrade the node to the same version. I don't think we have CI tests for node upgrade itself because node version has dependency on master version, so I didn't add it here.

@jiayingz jiayingz force-pushed the upgrade-test branch 2 times, most recently from f2f9815 to bb100d7 Compare May 31, 2018 05:53
@jiayingz
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-kubemark-e2e-gce-big

@jiayingz
Copy link
Contributor Author

jiayingz commented Jun 1, 2018

/test pull-kubernetes-e2e-gce-device-plugin-gpu
/test pull-kubernetes-e2e-gce

@vishh
Copy link
Contributor

vishh commented Jun 4, 2018

/approve

There are still a couple more comments. Will lgtm once they are resolved.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2018
Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jiayingz
Copy link
Contributor Author

jiayingz commented Jun 5, 2018

/kind feature
/priority important-soon

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 5, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jiayingz @vishh

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Help

@jiayingz
Copy link
Contributor Author

jiayingz commented Jun 5, 2018

/release-note none

@jiayingz
Copy link
Contributor Author

jiayingz commented Jun 5, 2018

/release-note-none

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jun 5, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 0bd77a2 into kubernetes:master Jun 5, 2018
k8s-github-robot pushed a commit that referenced this pull request Jun 11, 2018
…31-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #63631: Add gpu cluster upgrade test.

Cherry pick of #63631 on release-1.10.

#63631: Add gpu cluster upgrade test.

We added a gpu upgrade test config from 1.9-1.10 (kubernetes/test-infra#8262). From the test logs, looks like the cluster upgrade test is using the e2e test version from the latest 1.10 release which doesn't have the newly added GPUUpgrade test. There doesn't seem to be an easy way to use the e2e test version from a head release while running the upgrade test for older release version. Cherry-picking the upgrade test to 1.10 branch so that we can gpu upgrade test from 1.9 to 1.10.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants