-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI #77101
Add testing infra for checking whether an in-tree plugin is using backend that is shimmed to CSI #77101
Conversation
/priority important-soon |
@@ -128,7 +128,10 @@ const ( | |||
|
|||
// DriverInfo represents static information about a TestDriver. | |||
type DriverInfo struct { | |||
Name string // Name of the driver, aka the provisioner name. | |||
Name string // Internal name of the driver, this is NOT the provisioner name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just say this is a display name for the driver to use in the test case name and test objects
Name string // Internal name of the driver, this is NOT the provisioner name | ||
// Name of the in-tree plugin if it exists | ||
// empty if this DriverInfo represents a CSI Driver | ||
PluginName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that this is the fully qualified plugin name as registered in Kubernetes
test/e2e/storage/testsuites/base.go
Outdated
return totOps | ||
} | ||
|
||
func getVolumeOpsFromKubeletMetricsForPlugin(ms metrics.KubeletMetrics, pluginName string) opCounts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two methods look exactly the same, can we merge them (or part of them) somehow?
test/e2e/storage/testsuites/base.go
Outdated
continue | ||
} | ||
opName := string(sample.Metric["operation_name"]) | ||
if opName == "verify_controller_attached_volume" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this emitted for controller metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but since i've merged the two metrics functions I skip verify_controller_attached_volume
for both controller and Kubelet metrics. I see no issue with this
test/e2e/storage/testsuites/base.go
Outdated
for op, count := range o1 { | ||
seen.Insert(op) | ||
totOps[op] = totOps[op] + count | ||
totOps[op] = totOps[op] + o2[op] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o2[op] can go in the line before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brain fart lol. for some reason I thought I could only add exactly two numbers together
} | ||
|
||
func validateMigrationVolumeOpCounts(cs clientset.Interface, pluginName string, oldInTreeOps, oldMigratedOps opCounts) { | ||
if len(pluginName) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call the field in the driver config InTreePluginName then if you don't want it to be used for csi drivers
|
||
newInTreeOps, newMigratedOps := getMigrationVolumeOpCounts(cs, pluginName) | ||
|
||
if sets.NewString(strings.Split(*migratedPlugins, ",")...).Has(pluginName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check this first? It saves querying the components for the metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if this condition is not true, we want to get the newInTreeOps
. However, with #77101 (comment) this may no longer be true
test/e2e/storage/testsuites/base.go
Outdated
|
||
nodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{}) | ||
framework.ExpectNoError(err, "Error listing nodes: %v", err) | ||
for _, node := range nodes.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention that this validation will be checked by default when migration is on by default?
Iterating over all nodes may not work so well for scale testing clusters that are 1000s of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I believe most tests specify a "ClientNodeName" which we can use to only query the specific node the volume/pod is scheduled on. However for tests that don't specify what do we want the behavior to be?
Maybe we check how many nodes there are and above a certain threshhold we just decide to not check node metrics? That seems pretty hacky to me, however it still achieves the goal of validating migration functionality when we run the tests on small-ish clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a 10 node threshhold. However, I've noticed that only certain drivers set a nodeName (local, hostpath, iscsi) while the other drivers do not. I wonder if it is maybe worth having each of the drivers set a nodeName for their tests (which is kind of weird for multi-node tests). However, this seems like an issue with the test infra itself as I believe nodeName
should probably be a per-test level field, not a per-driver field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clientNodeName essentially bypasses the scheduler and forces the pod to go to a specific node. It should only be used for hostpath. I'm not sure why local needs it since local supports topology cc @cofyc
Anyway, we should not use node name, especially as we add more topology and multi-node, multi-writer tests in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case there is no real generic way (that I can think of) to scrape the metrics from the "correct" node that the test is running on and I'm thinking custom logic for each test doesn't seem great. It would be nice if the operation metric also emitted the node that it was done on, then we could pull which node it got attached to from the controller metric but that increases the carnality of the metric and is only useful for this specific functionality.
Maybe we only look at the controller metrics unless the cluster is "small enough"? That is far from ideal, but it gives us some signal to whether the feature is working or not.. In the near future we were thinking of "disabling" the entire in-tree plugin if the migration feature for that plugin is on anyway, so that combined with some negative metrics checking, combined with tests passing, is a pretty strong signal. Even if the metrics infra part is not the strongest signal on its own
test/e2e/storage/testsuites/base.go
Outdated
} | ||
totMigrated := getTotOps(newMigratedOps) | ||
oldTotMigrated := getTotOps(oldMigratedOps) | ||
if totMigrated-oldTotMigrated <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some negative test cases may not cause the operation metrics to be incremented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point. We could loosen the guarantees and only check:
if migrated -> no in-tree operations are done
if not migrated -> no migrated CSI Driver ops are done (we currently can't do this unless we add a "migrated" dimension to the metrics).
For the migrated case we at least know:
- No in-tree operations were done in any test
- The test passes so the e2e things we needed to do were achieved
these seem sufficient to determine that migration is "working". To determine that we're not "accidentally" using migration it is suffiecient to run all the tests with migration off without the driver installed.
So maybe this is enough. lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is ok.
if sets.NewString(strings.Split(*migratedPlugins, ",")...).Has(pluginName) { | ||
// If this plugin is migrated based on the test flag storage.migratedPlugins | ||
for op, count := range newInTreeOps { | ||
if count != oldInTreeOps[op] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be 0? Why would the intree metric be emitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, they can both be 0. The in-tree metric would be emitted if CSI Migration was not working and the operation was actually done with the in-tree plugin
@msau42 I've made the recommended changes/fixes. However, there are still some open questions. Please let me know what you think. |
83d89a0
to
385d573
Compare
// useCSIPlugin will check both CSIMigration and the plugin specific feature gate | ||
if useCSIPlugin(og.volumePluginMgr, volumeToAttach.VolumeSpec) && nu { | ||
// useCSIPlugin will check both CSIMigration and the plugin specific feature gates | ||
ucp := useCSIPlugin(og.volumePluginMgr, volumeToAttach.VolumeSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to define ucp? It's only used once in this function?
volumePlugin, err := | ||
og.volumePluginMgr.FindPluginBySpec(volumeToMount.VolumeSpec) | ||
if err == nil && volumePlugin != nil { | ||
volumePluginName = volumePlugin.GetPluginName() | ||
} | ||
|
||
mountVolumeFunc := func() (error, error) { | ||
originalSpec := volumeToMount.VolumeSpec | ||
|
||
// Get mounter plugin | ||
if useCSIPlugin(og.volumePluginMgr, volumeToMount.VolumeSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to translate the spec here too since it's done earlier now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safer to as the one done earlier just swallows any errors it encounters. This issue can be resolved along with this: https://github.com/kubernetes/kubernetes/pull/77101/files#diff-450e811a4953f760ff1594ede8b2037eR388
I think the idea would be to determine which plugin to emit the metric for when the operation is being run (since this now actually depends on some runtime state), instead of prior in the operation generator func.
385d573
to
a1f5d27
Compare
@@ -131,6 +136,10 @@ func (t *volumesTestSuite) defineTests(driver TestDriver, pattern testpatterns.T | |||
l.testCleanup() | |||
l.testCleanup = nil | |||
} | |||
|
|||
// TODO(dyzz): At this point the volume still has not unmounted or detached... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test cleanup checks that the Pod object is fully deleted, then that means unmount has been successful.
Also if cleanupResource() also handles deleting the PVC and waiting for the PV to disappear, then that means it needs to have been detached and deleted too.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidz627, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a1f5d27
to
1271237
Compare
@msau42 removed that TODOm cleaned up commits, fixed gofmt issue, should be ready for re-LGTM |
/lgtm |
It seems that this broke containerd test. https://k8s-testgrid.appspot.com/sig-node-containerd#e2e-gci /cc @davidz627 @msau42 |
@Random-Liu: GitHub didn't allow me to request PR reviews from the following users: davidz627. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In order to test the CSI Migration feature we implement infrastructure that will gather volume operation metrics at the start and end of each volume test. We will then compare the two and make sure that only CSI Driver metrics are emitted when the in-tree plugin is migrated, or that only in-tree operation metrics are emitted when the in-tree plugin is NOT migrated.
I am currently running all
gce pd
tests to confirm that this PR is working as expected, will update the PR comments as I get results.At least for a simple dynamic provisioning exec test
[sig-storage] In-tree Volumes [Driver: gcepd] [Testpattern: Dynamic PV (default fs)] volumes should allow exec of files on the volume
the implementation will pass for all expected success cases, and fail with reasonable error messages for the expected failure cases.There is one open issue about the validation of metrics happening before volume
unmount
anddetach
. Suggestions welcome but I am actively investigating how to get the validation to happen after volumes are removed from the pod./sig storage
/kind feature
/assign @ddebroy @leakingtapan @msau42 @jsafrane @saad-ali