Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STOR-1803: add vsphere snapshot configuration fields to ClusterCSIDriver #1783

Merged
merged 2 commits into from
Apr 15, 2024

Conversation

RomanBednar
Copy link
Contributor

@RomanBednar RomanBednar commented Mar 6, 2024

cc @openshift/storage

Enhancement: openshift/enhancements#1563

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 6, 2024

@RomanBednar: This pull request references STOR-1803 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cc @openshift/storage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 6, 2024
Copy link
Contributor

openshift-ci bot commented Mar 6, 2024

Hello @RomanBednar! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 6, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 6, 2024

@RomanBednar: This pull request references STOR-1803 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

cc @openshift/storage

Enhancement: openshift/enhancements#1563

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 7, 2024
@RomanBednar
Copy link
Contributor Author

verify-crd-schema error is probably not related to this change as I have not changed topologyCategories field: spec.driverConfig.vSphere.topologyCategories must set x-kubernetes-list-type - should it be overriden?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2024
@RomanBednar RomanBednar changed the title STOR-1803: add vsphere snapshot configuration fields to ClusterCSIDriver WIP: STOR-1803: add vsphere snapshot configuration fields to ClusterCSIDriver Mar 11, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2024
// +kubebuilder:validation:Optional
// +openshift:enable:FeatureSets=TechPreviewNoUpgrade
// +optional
GlobalMaxSnapshotsPerBlockVolume *uint32 `json:"globalMaxSnapshotsPerBlockVolume,omitempty"`
Copy link
Contributor

@jsafrane jsafrane Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all new fields:

  • There should be some validation, for example the value should not be negative. Or semantics of negative values should be described.
  • Are there any max values?
  • Value 0 should have some description - is it a valid value? Does it turn off snapshot support for VSAN / VVOL / globally?
  • What are the impacts of setting the values high? I would expect higher values mean less performance, otherwise we can set the values to MAXINT by default and we don't need any configuration.
  • It could help if there was link to vSphere docs, the questions could be answered there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some validation, for example the value should not be negative. Or semantics of negative values should be described.

  • Correct - currently the operator would fail to unmarshal as the field is uint32 type
  • Adding validation for min/max volumes.
W0312 17:09:26.526662    8935 reflector.go:539] k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: failed to list *v1.ClusterCSIDriver: json: cannot unmarshal number -1 into Go struct field VSphereCSIDriverConfigSpec.items.spec.driverConfig.vSphere.globalMaxSnapshotsPerBlockVolume of type uint32
E0312 17:09:26.526720    8935 reflector.go:147] k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: Failed to watch *v1.ClusterCSIDriver: failed to list *v1.ClusterCSIDriver: json: cannot unmarshal number -1 into Go struct field VSphereCSIDriverConfigSpec.items.spec.driverConfig.vSphere.globalMaxSnapshotsPerBlockVolume of type uint32

Are there any max values?

  • From VMware documentation: Maximum of 32 snapshots are **supported** in a chain. However, for a better performance use only 2 to 3 snapshots.
  • Adding validation for max value of 32.
  • Setting any higher value in the config is not prevented currently:
[Snapshot]
global-max-snapshots-per-block-volume = 33

Value 0 should have some description - is it a valid value? Does it turn off snapshot support for VSAN / VVOL / globally?

  • Not documented at all, after testing it seems like 0 value results in disabling the option and default is used.
  • Snapshots are not disabled when 0 is set
  • Setting min value to 1 should work fine as there is no reason to set 0
$ oc -n openshift-cluster-csi-drivers get cm/vsphere-csi-config -o yaml | grep global-max-snapshots-per-block-volume
    global-max-snapshots-per-block-volume = 0
message: 'Failed to create snapshot: failed to take snapshot of the volume c0e2dc41-33de-4566-9323-9aa422741493: "rpc error: code = FailedPrecondition desc = the number of snapshots on the source volume c0e2dc41-33de-4566-9323-9aa422741493 reaches the configured maximum (3)"'

What are the impacts of setting the values high? I would expect higher values mean less performance, otherwise we can set the values to MAXINT by default and we don't need any configuration.

  • This is documented under "Best practices for snapshots": https://kb.vmware.com/s/article/1025279
  • 2-3 snapshots are recommended for better performance so we can expect a performance drop when using more than 3 snapshots

It could help if there was link to vSphere docs, the questions could be answered there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I think linking vSphere docs will help users to actually understand what the options are good for and they don't need to guess.

Comment on lines 293 to 294
// overrides the global constraint if set, while it falls back to the global constraint if unset.
// +kubebuilder:validation:Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to use overrides globalMaxSnapshotsPerBlockVolume instead of the global constraint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - changed.

@RomanBednar RomanBednar force-pushed the driver-config branch 2 times, most recently from 204a82d to e8fa952 Compare March 12, 2024 14:59
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 12, 2024
@RomanBednar RomanBednar force-pushed the driver-config branch 2 times, most recently from f9ce480 to 7ea864d Compare March 13, 2024 13:33
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@jsafrane
Copy link
Contributor

The API looks good to me, you probably need to fix the tests somehow.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2024
@RomanBednar
Copy link
Contributor Author

Yeah, so there are some changes in progress to change the process so new API fields are not enabled by feature set but feature gate, documentation should arrive soon. From current observations it seems there should be 3 separate files now for clustercsidriver CRD, that means we have to change CSO image a bit.

Then verification requires a test file for each of those so I've added that, with a small test case for TechPreviewNoUpgrade with new fields - ci/prow/verify seems happy now.

The verify-crd-schema job seems to fail due to topologyCategories which is not in scope of this change and probably should be overriden:

	could not run schemacheck generator for group/version operator.openshift.io/v1: 
		error in 0000_90_cluster_csi_driver_01_config-CustomNoUpgrade.crd.yaml: ListsMustHaveSSATags: crd/clustercsidrivers.operator.openshift.io version/v1 field/^.spec.driverConfig.vSphere.topologyCategories must set x-kubernetes-list-type
		error in 0000_90_cluster_csi_driver_01_config-Default.crd.yaml: ListsMustHaveSSATags: crd/clustercsidrivers.operator.openshift.io version/v1 field/^.spec.driverConfig.vSphere.topologyCategories must set x-kubernetes-list-type
		error in 0000_90_cluster_csi_driver_01_config-TechPreviewNoUpgrade.crd.yaml: ListsMustHaveSSATags: crd/clustercsidrivers.operator.openshift.io version/v1 field/^.spec.driverConfig.vSphere.topologyCategories must set x-kubernetes-list-type
+ echo This verifier checks all files that have changed. In some cases you may have changed or renamed a file that already contained api violations, but you are not introducing a new violation. In such cases it is appropriate to /override the failing CI job. 

@jsafrane
Copy link
Contributor

The verify-crd-schema job seems to fail due to topologyCategories which is not in scope of this change and probably should be overriden:

No. The test works in other PRs without any issues.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 3, 2024
@RomanBednar
Copy link
Contributor Author

@deads2k Given the simplicity and potential cadence of similar PRs we want to release this without TechPreview. With the new process which is based on feature gates, can this be achieved by simply including the feature gate in Default cluster profile? https://github.com/openshift/api/pull/1783/files#diff-503cbb11f85749eb9ced1d2fcebdd1474b8df9ef32be2d845673838de8b7ff0cR596

@jsafrane
Copy link
Contributor

jsafrane commented Apr 4, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 4, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2024
// +optional
TopologyCategories []string `json:"topologyCategories,omitempty"`

// globalMaxSnapshotsPerBlockVolume is a global configuration parameter that applies to volumes on all kinds of
// datastores. If unset it defaults to 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like

If omitted, the platform chooses a default, which is subject to change over time, currently that default is 3.

This matches other fields. Rather than promising a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, changed.

// datastores. If unset it defaults to 3.
// Increasing number of snapshots above 3 can have negative impact on performance, for more details see: https://kb.vmware.com/s/article/1025279
// Volume snapshot documentation: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html
// +kubebuilder:validation:Minimum=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I choose: "no snapshots"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed here: #1783 (comment)

vmWare does not document this value but testing showed that it does not disable snapshots, but default is applied instead. I can add a clarifying statement:

Setting this value to 0 does not disable volume snapshots, but results in default value being used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed here: #1783 (comment)

Please add a sentence in the documentation indicating that snapshots cannot be disabled. On this item and the others if they cannot be disabled either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, for all 3 options as it's applicable to all of them.

reportProblemsToJiraComponent("Storage / Kubernetes External Components").
contactPerson("rbednar").
productScope(ocpSpecific).
enableIn(Default, TechPreviewNoUpgrade).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TechPreview first please. Demonstrate stability (either via automation or QE sign off) and we can promote in 4.16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know TechPreview first is the safest and most common option, however this change is relatively trivial and only allows setting options of an existing driver that are not new. We're expecting more requests similar to this one from customers and would like to see if there's a possibility to ship this in shorter time than two cycles.

Would it be possible to have this enabled by default in 4.16 (without TechPreview) if we have a sign off from QE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before becoming accessible-by-default, we must have evidence of completeness and reliability. The preferred mechanism is to run automated CI tests in our existing TechPreview jobs and see 95+% pass rates over at least 14 runs. The alternative mechanism is to have QE sign off on a promotion PR. That can all happen in a single release. For instance,

  1. in 4.16
  2. merge as techpreview
  3. add functionality and tests
  4. after there are 14 runs, open PR promoting to GA
  5. CI automatically checks for the associated tests (see example failure here)
  6. PR is merged on green OR QE signs off on the PR and we override the automated stability check (we track data on this now in 4.16)
  7. we ship 4.16

So this doesn't delay availability in 4.16 assuming the feature is tested and is as stable as expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k Removed Default cluster profile for this feature, can we merge this now as TechPreview?

@@ -279,8 +279,35 @@ type VSphereCSIDriverConfigSpec struct {
// If cluster Infrastructure object has a topology, values specified in
// Infrastructure object will be used and modifications to topologyCategories
// will be rejected.
// +listType=set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have failed our CRD schema checker. Why would this be a safe change to make for clients?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the checker was failing without listType, I've retried it again now and with the recent changes verification is passing without it - dropping this change.

@RomanBednar
Copy link
Contributor Author

@deads2k PTAL again

@@ -32146,6 +32146,21 @@
"description": "VSphereCSIDriverConfigSpec defines properties that can be configured for vsphere CSI driver.",
"type": "object",
"properties": {
"globalMaxSnapshotsPerBlockVolume": {
"description": "globalMaxSnapshotsPerBlockVolume is a global configuration parameter that applies to volumes on all kinds of datastores. If omitted, the platform chooses a default, which is subject to change over time, currently that default is 3. Setting this value to 0 does not disable volume snapshots, but results in default value being used. Increasing number of snapshots above 3 can have negative impact on performance, for more details see: https://kb.vmware.com/s/article/1025279 Volume snapshot documentation: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RomanBednar If we mentioned the value could be set to 0(TBH, I think we could reject the 0 value safely.), it seems we also need to change the minmum value setting in https://github.com/openshift/api/pull/1783/files#diff-5cb495d65e27aec64145d4043a9e32d029bab426765d92b147556d459be5735bR219 , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Phaow The comment about 0 value is meant to be rather informative but not needed, and just explains why 0 does not make sense. I can write it better or remove it. The current validation (minimum: 1) is rejecting 0 which is correct, or am I missing something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RomanBednar Yeah, current validation (minimum: 1) is rejecting 0 looks good to me, just a bit strange that we already reject 0, just the explain if set to 0 which will still use the default maybe make a bit confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Phaow ok, removing

@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 15, 2024
Copy link
Contributor

openshift-ci bot commented Apr 15, 2024

@RomanBednar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp d7b9ecd link false /test e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented Apr 15, 2024

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2024
Copy link
Contributor

openshift-ci bot commented Apr 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jsafrane, RomanBednar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit c0feb35 into openshift:master Apr 15, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants