Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephcluster: tune devices according to the deviceType #944

Merged
merged 2 commits into from Dec 8, 2020

Conversation

crombus
Copy link
Contributor

@crombus crombus commented Dec 7, 2020

Signed-off-by: crombus pkundra@redhat.com

@crombus crombus added this to the ocs 4.7 feature freeze milestone Dec 7, 2020
@crombus crombus added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 7, 2020
Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add to the commit message and the PR description explanations what problem this PR is trying to solve?

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get it right, the deviceType can be set in the storageDeviceSet in the storage cluster yaml input. If this is set, this can override the otherwise potentially wrongly detected device type. I.e. it has to be set explicitly by the admin or whoever is crafting the input (GUI..). This PR is taking this specified device type and propagates corresponding OSD tuning settings to rook, in addition to (and overriding) the settings of PR #864 that enabled tuning settings based on known storage class types for backend disks. To be honest the code is getting more and more confusing. I suggest rebasing these changes on top of the refactoring of the tuning code in #946 which will make the resulting code more obvious and this change even shorter.

I further suggest to add this new code to the throttleStorageDevices() function (called checkTuneStorageDevices after the refactor), instead of directly to the ensureCephCluster() function.

@crombus
Copy link
Contributor Author

crombus commented Dec 8, 2020

If I get it right, the deviceType can be set in the storageDeviceSet in the storage cluster yaml input. If this is set, this can override the otherwise potentially wrongly detected device type. I.e. it has to be set explicitly by the admin or whoever is crafting the input (GUI..). This PR is taking this specified device type and propagates corresponding OSD tuning settings to rook, in addition to (and overriding) the settings of PR #864 that enabled tuning settings based on known storage class types for backend disks. To be honest the code is getting more and more confusing. I suggest rebasing these changes on top of the refactoring of the tuning code in #946 which will make the resulting code more obvious and this change even shorter.

I further suggest to add this new code to the throttleStorageDevices() function (called checkTuneStorageDevices after the refactor), instead of directly to the ensureCephCluster() function.

Done the changes on top of the refactor code. I need suggestion to make it little better.

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing!

See inline request/suggestions to make this even better / more minimal.

Thanks!

pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, and if you would update the commit message a bit explaining that this is not about auto-detected hdd types, but about hdd types that explicitly configured by the admin (or gui?) in the deviceset section in the storagecluster yaml.

@crombus crombus force-pushed the tuning_device branch 2 times, most recently from b2b2ad1 to f7b9fa4 Compare December 8, 2020 14:05
@crombus crombus requested a review from obnoxxx December 8, 2020 14:05
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
@crombus
Copy link
Contributor Author

crombus commented Dec 8, 2020

/test ci/prow/ocs-operator-ci

@openshift-ci-robot
Copy link

@crombus: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test ci-index
  • /test images
  • /test ocs-operator-bundle-e2e-aws
  • /test ocs-operator-ci
  • /test red-hat-storage-ocs-ci-e2e-aws
  • /test verify-latest-csv

Use /test all to run all jobs.

In response to this:

/test ci/prow/ocs-operator-ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tune devices according to the deviceType added by
the admin in storagecluster.yaml and modify the
 testcase

Signed-off-by: crombus <pkundra@redhat.com>
check for that deviceType is considered
as priority

Signed-off-by: crombus <pkundra@redhat.com>
Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for the updates!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, obnoxxx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@obnoxxx
Copy link
Contributor

obnoxxx commented Dec 8, 2020

Now the bot has retriggered the ci runs, but before, ocs-operator-ci was failing in a strange way (already the second time on this PR):

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ocs-operator/944/pull-ci-openshift-ocs-operator-master-ocs-operator-ci/1336353842756849664

First, it is failing in TestEnsureExternalStorageClusterResources like this:

     external_resources_test.go:76: 
        	Error Trace:	external_resources_test.go:76
        	Error:      	Received unexpected error:
        	            	dial tcp 127.0.0.1:13095: connect: connection refused
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:194: 
        	Error Trace:	external_resources_test.go:194
        	            				external_resources_test.go:78
        	Error:      	Received unexpected error:
        	            	storageclasses.storage.k8s.io "ocsinit-ceph-rbd" not found
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:198: 
        	Error Trace:	external_resources_test.go:198
        	            				external_resources_test.go:78
        	Error:      	Not equal: 
        	            	expected: "device_health_metrics"
        	            	actual  : ""
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-device_health_metrics
        	            	+
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:194: 
        	Error Trace:	external_resources_test.go:194
        	            				external_resources_test.go:78
        	Error:      	Received unexpected error:
        	            	storageclasses.storage.k8s.io "ocsinit-cephfs" not found
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:198: 
        	Error Trace:	external_resources_test.go:198
        	            				external_resources_test.go:78
        	Error:      	Not equal: 
        	            	expected: "myfs"
        	            	actual  : ""
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-myfs
        	            	+
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:198: 
        	Error Trace:	external_resources_test.go:198
        	            				external_resources_test.go:78
        	Error:      	Not equal: 
        	            	expected: "myfs-data0"
        	            	actual  : ""
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-myfs-data0
        	            	+
        	Test:       	TestEnsureExternalStorageClusterResources
    external_resources_test.go:194: 
        	Error Trace:	external_resources_test.go:194
        	            				external_resources_test.go:78
        	Error:      	Received unexpected error:
        	            	storageclasses.storage.k8s.io "ocsinit-ceph-rgw" not found
        	Test:       	TestEnsureExternalStorageClusterResources 

but before that, there is a strange PASS:

 === RUN   TestCollectObjectStoreHealth
E1208 16:59:46.159345   11090 ceph-object-store.go:123] CephObjectStore in unexpected phase. Must be "Connected", "Progressing" or "Failure"
--- PASS: TestCollectObjectStoreHealth (0.00s)

Is this expected? 🤔

@obnoxxx
Copy link
Contributor

obnoxxx commented Dec 8, 2020

now ocs-operator-ci has passed - strange flake...

@openshift-merge-robot
Copy link
Contributor

@crombus: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws ada7e14 link /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit a602d12 into red-hat-storage:master Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants