Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI #77101

Merged
merged 2 commits into from
May 1, 2019

Conversation

davidz627
Copy link
Contributor

In order to test the CSI Migration feature we implement infrastructure that will gather volume operation metrics at the start and end of each volume test. We will then compare the two and make sure that only CSI Driver metrics are emitted when the in-tree plugin is migrated, or that only in-tree operation metrics are emitted when the in-tree plugin is NOT migrated.

I am currently running all gce pd tests to confirm that this PR is working as expected, will update the PR comments as I get results.

At least for a simple dynamic provisioning exec test [sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Dynamic PV (default fs)] volumes should allow exec of files on the volume the implementation will pass for all expected success cases, and fail with reasonable error messages for the expected failure cases.

There is one open issue about the validation of metrics happening before volume unmount and detach. Suggestions welcome but I am actively investigating how to get the validation to happen after volumes are removed from the pod.

/sig storage
/kind feature

/assign @ddebroy @leakingtapan @msau42 @jsafrane @saad-ali

NONE

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 25, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 25, 2019
@davidz627
Copy link
Contributor Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 25, 2019
@davidz627 davidz627 changed the title Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI WIP Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI Apr 25, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2019
@@ -128,7 +128,10 @@ const (

// DriverInfo represents static information about a TestDriver.
type DriverInfo struct {
Name string // Name of the driver, aka the provisioner name.
Name string // Internal name of the driver, this is NOT the provisioner name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just say this is a display name for the driver to use in the test case name and test objects

Name string // Internal name of the driver, this is NOT the provisioner name
// Name of the in-tree plugin if it exists
// empty if this DriverInfo represents a CSI Driver
PluginName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this is the fully qualified plugin name as registered in Kubernetes

return totOps
}

func getVolumeOpsFromKubeletMetricsForPlugin(ms metrics.KubeletMetrics, pluginName string) opCounts {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two methods look exactly the same, can we merge them (or part of them) somehow?

continue
}
opName := string(sample.Metric["operation_name"])
if opName == "verify_controller_attached_volume" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this emitted for controller metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but since i've merged the two metrics functions I skip verify_controller_attached_volume for both controller and Kubelet metrics. I see no issue with this

for op, count := range o1 {
seen.Insert(op)
totOps[op] = totOps[op] + count
totOps[op] = totOps[op] + o2[op]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o2[op] can go in the line before?

Copy link
Contributor Author

@davidz627 davidz627 Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain fart lol. for some reason I thought I could only add exactly two numbers together

}

func validateMigrationVolumeOpCounts(cs clientset.Interface, pluginName string, oldInTreeOps, oldMigratedOps opCounts) {
if len(pluginName) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call the field in the driver config InTreePluginName then if you don't want it to be used for csi drivers


newInTreeOps, newMigratedOps := getMigrationVolumeOpCounts(cs, pluginName)

if sets.NewString(strings.Split(*migratedPlugins, ",")...).Has(pluginName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check this first? It saves querying the components for the metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if this condition is not true, we want to get the newInTreeOps. However, with #77101 (comment) this may no longer be true


nodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{})
framework.ExpectNoError(err, "Error listing nodes: %v", err)
for _, node := range nodes.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention that this validation will be checked by default when migration is on by default?

Iterating over all nodes may not work so well for scale testing clusters that are 1000s of nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I believe most tests specify a "ClientNodeName" which we can use to only query the specific node the volume/pod is scheduled on. However for tests that don't specify what do we want the behavior to be?

Maybe we check how many nodes there are and above a certain threshhold we just decide to not check node metrics? That seems pretty hacky to me, however it still achieves the goal of validating migration functionality when we run the tests on small-ish clusters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a 10 node threshhold. However, I've noticed that only certain drivers set a nodeName (local, hostpath, iscsi) while the other drivers do not. I wonder if it is maybe worth having each of the drivers set a nodeName for their tests (which is kind of weird for multi-node tests). However, this seems like an issue with the test infra itself as I believe nodeName should probably be a per-test level field, not a per-driver field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientNodeName essentially bypasses the scheduler and forces the pod to go to a specific node. It should only be used for hostpath. I'm not sure why local needs it since local supports topology cc @cofyc

Anyway, we should not use node name, especially as we add more topology and multi-node, multi-writer tests in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case there is no real generic way (that I can think of) to scrape the metrics from the "correct" node that the test is running on and I'm thinking custom logic for each test doesn't seem great. It would be nice if the operation metric also emitted the node that it was done on, then we could pull which node it got attached to from the controller metric but that increases the carnality of the metric and is only useful for this specific functionality.

Maybe we only look at the controller metrics unless the cluster is "small enough"? That is far from ideal, but it gives us some signal to whether the feature is working or not.. In the near future we were thinking of "disabling" the entire in-tree plugin if the migration feature for that plugin is on anyway, so that combined with some negative metrics checking, combined with tests passing, is a pretty strong signal. Even if the metrics infra part is not the strongest signal on its own

}
totMigrated := getTotOps(newMigratedOps)
oldTotMigrated := getTotOps(oldMigratedOps)
if totMigrated-oldTotMigrated <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some negative test cases may not cause the operation metrics to be incremented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point. We could loosen the guarantees and only check:
if migrated -> no in-tree operations are done
if not migrated -> no migrated CSI Driver ops are done (we currently can't do this unless we add a "migrated" dimension to the metrics).

For the migrated case we at least know:

  1. No in-tree operations were done in any test
  2. The test passes so the e2e things we needed to do were achieved
    these seem sufficient to determine that migration is "working". To determine that we're not "accidentally" using migration it is suffiecient to run all the tests with migration off without the driver installed.

So maybe this is enough. lmk what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is ok.

if sets.NewString(strings.Split(*migratedPlugins, ",")...).Has(pluginName) {
// If this plugin is migrated based on the test flag storage.migratedPlugins
for op, count := range newInTreeOps {
if count != oldInTreeOps[op] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be 0? Why would the intree metric be emitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they can both be 0. The in-tree metric would be emitted if CSI Migration was not working and the operation was actually done with the in-tree plugin

@davidz627
Copy link
Contributor Author

@msau42 I've made the recommended changes/fixes. However, there are still some open questions. Please let me know what you think.

@davidz627 davidz627 changed the title WIP Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI Apr 29, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2019
// useCSIPlugin will check both CSIMigration and the plugin specific feature gate
if useCSIPlugin(og.volumePluginMgr, volumeToAttach.VolumeSpec) && nu {
// useCSIPlugin will check both CSIMigration and the plugin specific feature gates
ucp := useCSIPlugin(og.volumePluginMgr, volumeToAttach.VolumeSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to define ucp? It's only used once in this function?

volumePlugin, err :=
og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec)
if err == nil && volumePlugin != nil {
volumePluginName = volumePlugin.GetPluginName()
}

mountVolumeFunc := func() (error, error) {
originalSpec := volumeToMount.VolumeSpec

// Get mounter plugin
if useCSIPlugin(og.volumePluginMgr, volumeToMount.VolumeSpec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to translate the spec here too since it's done earlier now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safer to as the one done earlier just swallows any errors it encounters. This issue can be resolved along with this: https://github.com/kubernetes/kubernetes/pull/77101/files#diff-450e811a4953f760ff1594ede8b2037eR388

I think the idea would be to determine which plugin to emit the metric for when the operation is being run (since this now actually depends on some runtime state), instead of prior in the operation generator func.

@@ -131,6 +136,10 @@ func (t *volumesTestSuite) defineTests(driver TestDriver, pattern testpatterns.T
l.testCleanup()
l.testCleanup = nil
}

// TODO(dyzz): At this point the volume still has not unmounted or detached...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test cleanup checks that the Pod object is fully deleted, then that means unmount has been successful.

Also if cleanupResource() also handles deleting the PVC and waiting for the PV to disappear, then that means it needs to have been detached and deleted too.

@msau42
Copy link
Member

msau42 commented May 1, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 1, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 1, 2019
@davidz627
Copy link
Contributor Author

@msau42 removed that TODOm cleaned up commits, fixed gofmt issue, should be ready for re-LGTM

@msau42
Copy link
Member

msau42 commented May 1, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 1, 2019
@k8s-ci-robot k8s-ci-robot merged commit 2b56145 into kubernetes:master May 1, 2019
@davidz627 davidz627 deleted the feature/migrationTest branch May 1, 2019 22:52
@Random-Liu
Copy link
Member

It seems that this broke containerd test.

https://k8s-testgrid.appspot.com/sig-node-containerd#e2e-gci

/cc @davidz627 @msau42

@k8s-ci-robot k8s-ci-robot requested a review from msau42 May 2, 2019 08:08
@k8s-ci-robot
Copy link
Contributor

@Random-Liu: GitHub didn't allow me to request PR reviews from the following users: davidz627.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

It seems that this broke containerd test.

https://k8s-testgrid.appspot.com/sig-node-containerd#e2e-gci

/cc @davidz627 @msau42

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants