Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephcluster: tune disks according to plaftorm #955

Merged
merged 2 commits into from Jan 15, 2021

Conversation

crombus
Copy link
Contributor

@crombus crombus commented Dec 10, 2020

set tunefastdeviceclass according to plaftorm
which are not correctly detected by ceph.
modify tc according to the changes.

Signed-off-by: crombus pkundra@redhat.com

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before, looks good in general, but some requests inside.

pkg/controller/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
@@ -55,3 +61,18 @@ func avoidObjectStore(p configv1.PlatformType) bool {
}
return false
}

func (p *Platform) TunePlatformForFastDevices(c client.Client) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do

func (r *ReconcileStorageCluster) TunePlatformForFastDevices() (bool, error) {
  platform := r.platform
  client := r.client
  ...
}

Makes it easier on the caller, Imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that we keep the old way.

It seems like we made it easier on the caller but we actually made it difficult.
While with the old implementation caller only required a *Platform and a client.Client, with the suggested change caller would now need to create a concrete *ReconcileStorageCluster even though it isn't necessary.

It is easy to ignore this as we seem to have a *ReconcileStorageCluster always available but we shouldn't rely on that.
Also, since this tuning is a behavior that depends on platform rather than the reconciler, it should really be a method on Platform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have an internal method on platform.
But I request the convenience wrapper on the reconciler, since it has all we need, and it avoids code duplication (like with avoidObjectStore) when there might be multiple callers.
This pattern of GetPlatform and then invoke the other method is cumbersome and error-prone imho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the end, I won't insist too much if folks are more inline with the original approach. I think both approaches have pros and cons. ;-)

pkg/controller/storagecluster/platform_detection.go Outdated Show resolved Hide resolved
@openshift-merge-robot
Copy link
Contributor

@crombus: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws 99c3a78 link /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good now, thanks!

There is one change in the test code that seems to not really belong into this commit. (see inline) - it could be a separate patch to test the effects of PR #944.

And shouldn't we have a test case for this new behaviour?

@@ -304,11 +304,12 @@ func TestThrottleStorageDevices(t *testing.T) {
StorageClassName: &fakestorageClassName,
},
},
Portable: true,
Portable: true,
DeviceType: "hdd",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a change to the test case that is not related to this code change. 🤔

It explicitly configures the type "hdd" in the device set and correctly changes the expected result to Slow. But isn't that the test for the previous PR #944 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to solve this error Error Trace: storagecluster_controller_test.go:426 Error: Received unexpected error: could not get infrastructure details to determine cloud platform: infrastructures.config.openshift.io "cluster" not found

Copy link
Contributor Author

@crombus crombus Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yes for this we will have to modify the creation of cluster with platform adding new test cases for this addidtion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crombus - hmmm - wouldn't it be better / more natural to fix this problem by changing the cluster structure in the test so that the platform detection can not fail? It does not seem to be conceptually correct to fix it by avoiding the call in the first place by adding a configuration we know will be checked first...

Can the platform detection fail in the real world? (not mocked-up test code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can fail. And cration of cluster with platform is the correct fix

@@ -550,6 +550,14 @@ func (r *ReconcileStorageCluster) checkTuneStorageDevices(ds ocsv1.StorageDevice
return dt.speed, nil
}

ret, err := r.TunePlatformForFastDevices()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a more simpler implementation would be to check if TuneFastPlatforms contains r.platform.GetPlatform().
That way I wouldn't have to jump to definition of TunePlatformForFastDevices to understand what is happening and why we are tuning.

Currently it seems like TunePlatformForFastDevices is doing tuning intensive stuff while in reality it's just checking if TuneFastPlatforms contains r.platform.GetPlatform().

As a follow-up I'd look into re-factoring platform detection and maybe storing it in ReconcileStorageCluster if it makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me I feel both ways are good and maybe if adding comment solve this problem that is also a way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If your code needs a comment to explain what it does, it isn't really a readable/understandable code".
Read this somewhere and I relate.

We could inline the call, rename the method, use better variables etc to make it better. Adding a comment would be the last thing I'd suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a more simpler implementation would be to check if TuneFastPlatforms contains r.platform.GetPlatform().

Well, that is exactly what the function does!?

That way I wouldn't have to jump to definition of TunePlatformForFastDevices to understand what is happening and why we are tuning.

That's more a naming thing than an implementation thing. I explicitly asked @crombus to write such a function to not have the implementation bleed into the caller. I didn't want to repeat the (imho) bad approach of aviodObjectStore() where each caller first has to go and call GetPlatform and then insert it into avoidObjectStore. This is bad encapsulation/abstraction, and error prone. If there is more than one caller, it's also code duplication.

So it boils down to: we only need to name the function / vars better to make it more obvious what is going on.

Currently it seems like TunePlatformForFastDevices is doing tuning intensive stuff while in reality it's just checking if TuneFastPlatforms contains r.platform.GetPlatform().

Right, there needs to be a better name. I like it when in some languages, like ruby, functions can end with a '?' symbol. That makes it more obvious. But here we don't have that option. We could do something like this:

tuneFastDevices, err := r.DevicesDefaultToFastForThisPlatform()
...

or

tuneDevicesFast, err := DoTuneDevicesFastForPlatform()
...

or similar...

As a follow-up I'd look into re-factoring platform detection and maybe storing it in ReconcileStorageCluster if it makes more sense.

Possible. I have started a discussion with @crombus already about possible refactorings of the platform-detection code triggered by this change. Storing data sounds like an optimization. But let's get the APIs good in the first place. And let's not add new stuff that needs immediate refactoring afterwards. And to reiterate my initial point: I think the current approach is the right encapsulation and just needs better naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crombus - can you rename the function and variable along the above lines to make the code more obvious?

@@ -304,11 +304,12 @@ func TestThrottleStorageDevices(t *testing.T) {
StorageClassName: &fakestorageClassName,
},
},
Portable: true,
Portable: true,
DeviceType: "hdd",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see cases of different possible values of DeviceType covered. Also, a case for invalid DeviceType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this is not the topic of this PR. That was introduced in a different PR. So we can (and should) enhance that in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not the topic of this PR, why does it add DeviceType: "hdd" to test case?
Honestly, I've lost track of "Tune disks" PRs and enhancements 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not the topic of this PR, why does it add DeviceType: "hdd" to test case?

This is exactly what I told @crombus: Using this to fix the test failure is wrong. It is just avoiding the code path.
Instead we need to make the fake cluster so that it has platform info, so the new code does not error out.

Honestly, I've lost track of "Tune disks" PRs and enhancements sweat_smile

😉

@@ -550,6 +550,14 @@ func (r *ReconcileStorageCluster) checkTuneStorageDevices(ds ocsv1.StorageDevice
return dt.speed, nil
}

ret, err := r.TunePlatformForFastDevices()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crombus - can you rename the function and variable along the above lines to make the code more obvious?

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 24, 2020
@jarrpa
Copy link
Member

jarrpa commented Jan 13, 2021

@crombus Hey, can we revisit this?

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2021
@crombus crombus force-pushed the tune_platform branch 2 times, most recently from 45dbe6f to 425844e Compare January 15, 2021 10:19
@crombus
Copy link
Contributor Author

crombus commented Jan 15, 2021

All requests have been addressed except from changing the test-case accordingly .

I get this error
gp2". multiple group-version-kinds associated with type *v1.StorageClass, refusing to guess at one

for changes in tc

       for _, tc := range testcases {
               reconciler := createFakeInitializationStorageClusterReconcilerWithPlatform(t, tc.platform, tc.storageCluster, tc.storageClass)
               for _, ds := range tc.deviceSets {
                       actualSpeed, err := reconciler.checkTuneStorageDevices(ds)
                       assert.NoError(t, err)
                       assert.Equalf(t, tc.expectedSpeed, actualSpeed, "[%q]: failed to get expected output", tc.label)
               }
       }

crombus and others added 2 commits January 15, 2021 12:27
set tunefastdeviceclass according to plaftorm
which are not correctly detected by ceph.
modify tc according to the changes by adding
a default platform (of None) to the reconciler.

Signed-off-by: crombus <pkundra@redhat.com>
This is tested with the example of the Azure platform.

Note: This can probably be done more elegantly, as currently, the test
adds the configured platform to the reconciler after its creation and
not as part of the creator function. But it should be good enough for
a start.

Signed-off-by: Michael Adam <obnox@redhat.com>
@obnoxxx
Copy link
Contributor

obnoxxx commented Jan 15, 2021

Updated the tc fix (by adding a default platform entry to the reconciler) and added a specific tc for the new code path after discussing with @crombus .

let's see what the test runs bring

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: obnoxxx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 15, 2021

@crombus: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws 4091a66 link /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit c14dbba into red-hat-storage:master Jan 15, 2021
@crombus
Copy link
Contributor Author

crombus commented Jan 15, 2021

/cherry-pick release-4.7

@openshift-cherrypick-robot

@crombus: new pull request created: #994

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants