-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpu cluster upgrade test. #63631
Conversation
@jiayingz: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @vishh @mindprince |
test/e2e/upgrades/nvidia-gpu.go
Outdated
t.createdPod = podClient.Create(testPod) | ||
} | ||
|
||
// testPod creates a pod that requests gpu and runs a simple cuda job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
9c9d6a3
to
6e7b1ee
Compare
framework.ExpectNoError(framework.MasterUpgrade(target)) | ||
framework.ExpectNoError(framework.CheckMasterVersion(f.ClientSet, target)) | ||
} | ||
runUpgradeSuite(f, gpuUpgradeTests, testFrameworks, testSuite, upgCtx, upgrades.ClusterUpgrade, upgradeFunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be upgrades.MasterUpgrade
instead of upgrades.ClusterUpgrade
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// in a "Describe". | ||
testFrameworks := createUpgradeFrameworks(gpuUpgradeTests) | ||
Describe("master upgrade", func() { | ||
It("should NOT disrrupt gpu pod [Feature:GPUMasterUpgrade]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should be disrupt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/e2e/upgrades/nvidia-gpu.go
Outdated
} | ||
|
||
// startPod creates a pod that requests gpu and runs a simple cuda job. | ||
func (t *NvidiaGPUUpgradeTest) startPod(f *framework.Framework) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reuse makeCudaAdditionDevicePluginTestPod
?
test/e2e/upgrades/nvidia-gpu.go
Outdated
// Test waits for the upgrade to complete, and then verifies that the | ||
// cuda pod can successfully finish. | ||
func (t *NvidiaGPUUpgradeTest) Test(f *framework.Framework, done <-chan struct{}, upgrade UpgradeType) { | ||
<-done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of master-upgrade-test, we start the pod before the upgrade (in Setup()
), then trigger the upgrade and then wait for the upgrade to finish and check if the pod succeeded or not.
Is the assumption here that upgrade will begin before the pod has already completed? Because otherwise the test is not testing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mindprince Since pod will never be run on master, does it still make sense to test something related to pod lifecycle in master-upgrade-test case?
I think intention here is to test that pod is still running even after master upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to use Job instead of Pod to verify GPU resource can be consumed correctly in case of both master upgrade and node upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not run the vector add in a loop and check that there have been no failures? As of now as @mindprince mentioned, the job could in theory succeed prior to (or during) master upgrade which makes the test non-deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job is configured to continuously run for 100 times, which should cover the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I added some comment to make this clear.
/sig node |
test/e2e/upgrades/nvidia-gpu.go
Outdated
scheduling.SetupNVIDIAGPUNode(f, false) | ||
By("Creating a pod requesting gpu") | ||
t.startPod(f) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should verifyPodSuccess
be invoked here before returning from setup()?
// Create the frameworks here because we can only create them | ||
// in a "Describe". | ||
testFrameworks := createUpgradeFrameworks(gpuUpgradeTests) | ||
Describe("master upgrade", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "node upgrade" is skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node upgrade is exercised through cluster upgrade during which we first upgrade master to the target version and then upgrade the node to the same version. I don't think we have CI tests for node upgrade itself because node version has dependency on master version, so I didn't add it here.
f2f9815
to
bb100d7
Compare
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/approve There are still a couple more comments. Will lgtm once they are resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, vishh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind feature |
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process Pull Request Labels
|
/release-note none |
/release-note-none |
Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions here. |
…31-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #63631: Add gpu cluster upgrade test. Cherry pick of #63631 on release-1.10. #63631: Add gpu cluster upgrade test. We added a gpu upgrade test config from 1.9-1.10 (kubernetes/test-infra#8262). From the test logs, looks like the cluster upgrade test is using the e2e test version from the latest 1.10 release which doesn't have the newly added GPUUpgrade test. There doesn't seem to be an easy way to use the e2e test version from a head release while running the upgrade test for older release version. Cherry-picking the upgrade test to 1.10 branch so that we can gpu upgrade test from 1.9 to 1.10.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506
Release note: