Add gpu cluster upgrade test. #63631

jiayingz · 2018-05-09T23:41:55Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
Currently running GPUMasterUpgrade test should pass with gpu nodes but running GPUClusterUpgrade test will run into #63506

Release note:

k8s-ci-robot · 2018-05-09T23:41:56Z

@jiayingz: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jiayingz · 2018-05-09T23:45:38Z

/cc @vishh @mindprince

wgliang · 2018-05-09T23:47:05Z

test/e2e/upgrades/nvidia-gpu.go

+	t.createdPod = podClient.Create(testPod)
+}
+
+// testPod creates a pod that requests gpu and runs a simple cuda job.


The comment is not right!

rohitagarwal003 · 2018-05-15T00:10:43Z

test/e2e/lifecycle/cluster_upgrade.go

+				framework.ExpectNoError(framework.MasterUpgrade(target))
+				framework.ExpectNoError(framework.CheckMasterVersion(f.ClientSet, target))
+			}
+			runUpgradeSuite(f, gpuUpgradeTests, testFrameworks, testSuite, upgCtx, upgrades.ClusterUpgrade, upgradeFunc)


Shouldn't this be upgrades.MasterUpgrade instead of upgrades.ClusterUpgrade?

rohitagarwal003 · 2018-05-15T00:12:01Z

test/e2e/lifecycle/cluster_upgrade.go

+	// in a "Describe".
+	testFrameworks := createUpgradeFrameworks(gpuUpgradeTests)
+	Describe("master upgrade", func() {
+		It("should NOT disrrupt gpu pod [Feature:GPUMasterUpgrade]", func() {


Typo: should be disrupt.

rohitagarwal003 · 2018-05-15T00:16:38Z

test/e2e/upgrades/nvidia-gpu.go

+}
+
+// startPod creates a pod that requests gpu and runs a simple cuda job.
+func (t *NvidiaGPUUpgradeTest) startPod(f *framework.Framework) {


Why not reuse makeCudaAdditionDevicePluginTestPod?

rohitagarwal003 · 2018-05-15T00:22:47Z

test/e2e/upgrades/nvidia-gpu.go

+// Test waits for the upgrade to complete, and then verifies that the
+// cuda pod can successfully finish.
+func (t *NvidiaGPUUpgradeTest) Test(f *framework.Framework, done <-chan struct{}, upgrade UpgradeType) {
+	<-done


In case of master-upgrade-test, we start the pod before the upgrade (in Setup()), then trigger the upgrade and then wait for the upgrade to finish and check if the pod succeeded or not.

Is the assumption here that upgrade will begin before the pod has already completed? Because otherwise the test is not testing anything.

@mindprince Since pod will never be run on master, does it still make sense to test something related to pod lifecycle in master-upgrade-test case?
I think intention here is to test that pod is still running even after master upgrade.

I updated the code to use Job instead of Pod to verify GPU resource can be consumed correctly in case of both master upgrade and node upgrade.

Why not run the vector add in a loop and check that there have been no failures? As of now as @mindprince mentioned, the job could in theory succeed prior to (or during) master upgrade which makes the test non-deterministic.

The job is configured to continuously run for 100 times, which should cover the upgrade.

FYI, I added some comment to make this clear.

vikaschoudhary16 · 2018-05-16T03:39:41Z

/sig node

vikaschoudhary16 · 2018-05-21T06:38:17Z

test/e2e/upgrades/nvidia-gpu.go

+	scheduling.SetupNVIDIAGPUNode(f, false)
+	By("Creating a pod requesting gpu")
+	t.startPod(f)
+}


Should verifyPodSuccess be invoked here before returning from setup()?

vikaschoudhary16 · 2018-05-21T06:50:53Z

test/e2e/lifecycle/cluster_upgrade.go

+	// Create the frameworks here because we can only create them
+	// in a "Describe".
+	testFrameworks := createUpgradeFrameworks(gpuUpgradeTests)
+	Describe("master upgrade", func() {


Why "node upgrade" is skipped?

Node upgrade is exercised through cluster upgrade during which we first upgrade master to the target version and then upgrade the node to the same version. I don't think we have CI tests for node upgrade itself because node version has dependency on master version, so I didn't add it here.

jiayingz · 2018-05-31T17:00:37Z

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-kubemark-e2e-gce-big

jiayingz · 2018-06-01T22:00:42Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu
/test pull-kubernetes-e2e-gce

vishh · 2018-06-04T17:38:04Z

/approve

There are still a couple more comments. Will lgtm once they are resolved.

vishh

/lgtm
/approve

k8s-ci-robot · 2018-06-04T23:11:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [vishh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jiayingz · 2018-06-05T00:58:51Z

/kind feature
/priority important-soon

k8s-github-robot · 2018-06-05T00:59:24Z

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jiayingz @vishh

Pull Request Labels

sig/node: Pull Request will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/feature: New functionality.

Help

jiayingz · 2018-06-05T01:01:45Z

/release-note none

jiayingz · 2018-06-05T01:02:39Z

/release-note-none

k8s-github-robot · 2018-06-05T09:16:14Z

Automatic merge from submit-queue (batch tested with PRs 64344, 64709, 64717, 63631, 58647). If you want to cherry-pick this change to another branch, please follow the instructions here.

…31-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #63631: Add gpu cluster upgrade test. Cherry pick of #63631 on release-1.10. #63631: Add gpu cluster upgrade test. We added a gpu upgrade test config from 1.9-1.10 (kubernetes/test-infra#8262). From the test logs, looks like the cluster upgrade test is using the e2e test version from the latest 1.10 release which doesn't have the newly added GPUUpgrade test. There doesn't seem to be an easy way to use the e2e test version from a head release while running the upgrade test for older release version. Cherry-picking the upgrade test to 1.10 branch so that we can gpu upgrade test from 1.9 to 1.10.

k8s-ci-robot requested review from enisoc and enj May 9, 2018 23:42

k8s-ci-robot requested review from rohitagarwal003 and vishh May 9, 2018 23:45

wgliang reviewed May 9, 2018

View reviewed changes

jiayingz mentioned this pull request May 9, 2018

A DaemonSet pod was not upgraded correctly from a master upgrade test #63634

Closed

AishSundar mentioned this pull request May 11, 2018

SkewTest failing gce-1.10-master-upgrade-cluster-new job in sig-release-master-upgrade #63506

Closed

jiayingz force-pushed the upgrade-test branch 2 times, most recently from 9c9d6a3 to 6e7b1ee Compare May 14, 2018 16:26

jiayingz mentioned this pull request May 14, 2018

Adds a device plugin upgrade test under test/e2e/upgrades/ #63630

Closed

rohitagarwal003 reviewed May 15, 2018

View reviewed changes

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 16, 2018

vikaschoudhary16 reviewed May 21, 2018

View reviewed changes

jiayingz force-pushed the upgrade-test branch 2 times, most recently from f2f9815 to bb100d7 Compare May 31, 2018 05:53

jiayingz force-pushed the upgrade-test branch from bb100d7 to fe074d3 Compare June 1, 2018 01:03

jiayingz mentioned this pull request Jun 1, 2018

From 1.11, an extended resource reported by a device plugin can be left on a node after node upgrade even though its device plugin never registers back #64632

Closed

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2018

Add gpu cluster upgrade test.

c9e85ec

jiayingz force-pushed the upgrade-test branch from fe074d3 to c9e85ec Compare June 4, 2018 19:55

k8s-ci-robot assigned vishh Jun 4, 2018

vishh approved these changes Jun 4, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2018

vishh added this to the v1.11 milestone Jun 5, 2018

k8s-github-robot added the milestone/incomplete-labels label Jun 5, 2018

vishh added status/approved-for-milestone release-note-none Denotes a PR that doesn't merit a release note. and removed milestone/incomplete-labels labels Jun 5, 2018

k8s-github-robot added the milestone/incomplete-labels label Jun 5, 2018

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 5, 2018

k8s-github-robot removed the milestone/incomplete-labels label Jun 5, 2018

k8s-ci-robot removed the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Jun 5, 2018

k8s-github-robot merged commit 0bd77a2 into kubernetes:master Jun 5, 2018

jiayingz mentioned this pull request Jun 6, 2018

Automated cherry pick of #63631: Add gpu cluster upgrade test. #64842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu cluster upgrade test. #63631

Add gpu cluster upgrade test. #63631

jiayingz commented May 9, 2018 •

edited

Loading

k8s-ci-robot commented May 9, 2018

jiayingz commented May 9, 2018

wgliang May 9, 2018

jiayingz May 14, 2018

rohitagarwal003 May 15, 2018

vikaschoudhary16 May 21, 2018

jiayingz May 30, 2018

rohitagarwal003 May 15, 2018

jiayingz Jun 4, 2018

rohitagarwal003 May 15, 2018

rohitagarwal003 May 15, 2018

vikaschoudhary16 May 21, 2018 •

edited

Loading

jiayingz May 30, 2018

vishh Jun 4, 2018

jiayingz Jun 4, 2018

jiayingz Jun 4, 2018

vikaschoudhary16 commented May 16, 2018

vikaschoudhary16 May 21, 2018

vikaschoudhary16 May 21, 2018

jiayingz May 30, 2018

jiayingz commented May 31, 2018

jiayingz commented Jun 1, 2018

vishh commented Jun 4, 2018

vishh left a comment

k8s-ci-robot commented Jun 4, 2018

jiayingz commented Jun 5, 2018

k8s-github-robot commented Jun 5, 2018

jiayingz commented Jun 5, 2018

jiayingz commented Jun 5, 2018

k8s-github-robot commented Jun 5, 2018

Add gpu cluster upgrade test. #63631

Add gpu cluster upgrade test. #63631

Conversation

jiayingz commented May 9, 2018 • edited Loading

k8s-ci-robot commented May 9, 2018

jiayingz commented May 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 commented May 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiayingz commented May 31, 2018

jiayingz commented Jun 1, 2018

vishh commented Jun 4, 2018

vishh left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 4, 2018

jiayingz commented Jun 5, 2018

k8s-github-robot commented Jun 5, 2018

jiayingz commented Jun 5, 2018

jiayingz commented Jun 5, 2018

k8s-github-robot commented Jun 5, 2018

jiayingz commented May 9, 2018 •

edited

Loading

vikaschoudhary16 May 21, 2018 •

edited

Loading