Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-9996: make lvms configuration more robust #1489

Merged

Conversation

dhellmann
Copy link
Contributor

@dhellmann dhellmann commented Mar 11, 2023

Instead of requiring the user to configure their system with a "rhel"
volume group, look at the volume groups that exist and pick one as the default. If there is only 1, use it. If there is a "microshift" VG, use it. Otherwise, disable the CSI driver.

/assign @copejon

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 11, 2023
@openshift-ci-robot
Copy link

@dhellmann: This pull request references Jira Issue OCPBUGS-9996, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instead of requiring the user to configure their system with a "rhel"
volume group, look at the volume groups that exist and pick one as the
default. If there is a "rhel" VG, use it. Otherwise, select the
largest volume group. If there are no volume groups at all, disable
the CSI driver.

/assign @copejon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 11, 2023
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2023
@dhellmann
Copy link
Contributor Author

/cherry-pick release-4.13

@openshift-cherrypick-robot

@dhellmann: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor Author

/test e2e-reboot
/test e2e-openshift-conformance-reduced

@copejon
Copy link
Contributor

copejon commented Mar 14, 2023

I'm pretty hesitant to make educated guesses about users' systems. For one, it means user's can't expect uniform default behavior from the platform and will get different results based on their storage stack. It'd be safer instead to select the VG that backs the user's homedir. This would mesh well with current dev env we suggest with a microshift user, be predictable, and allow users to select special VGs if they want. Making the 'rhel' name special also makes the storage configuration just a little bit more esoteric - one more reason to assume home VG if none are provided.

@ggiguash
Copy link
Contributor

ggiguash commented Mar 14, 2023

Maybe, we should not attempt selecting any volumes if 'rhel' does not exist and nothing is specified in the config?
Disabling CSI in case of a non-existent config is great.

@dhellmann
Copy link
Contributor Author

I'm pretty hesitant to make educated guesses about users' systems. For one, it means user's can't expect uniform default behavior from the platform and will get different results based on their storage stack. It'd be safer instead to select the VG that backs the user's homedir. This would mesh well with current dev env we suggest with a microshift user, be predictable, and allow users to select special VGs if they want. Making the 'rhel' name special also makes the storage configuration just a little bit more esoteric - one more reason to assume home VG if none are provided.

The point about predictability is fair. There is no user with a home directory, so I'm not sure how to use that to be more deterministic. Just look for /home? Maybe the volume group where / is would be a better rule?

How many hosts where someone is likely to run MicroShift actually have multiple volume groups?

Maybe, we should not attempt selecting any volumes if 'rhel' does not exist and nothing is specified in the config?
Disabling CSI in case of a non-existent config is great.

Disabling if there is no 'rhel' VG and the user has not provided a config file would be more predictable. It's not a minimal configuration for a nice out-of-the-box experience, though, so I was hoping to make things "just work."

@ggiguash
Copy link
Contributor

ggiguash commented Mar 14, 2023

My vote would be to make it simple and robust rather than "clever". There's no end to users' imagination, so we may inadvertedly end up in situations where the volume selection logic does not behave well.

@dhellmann
Copy link
Contributor Author

My vote would be to make it simple and robust rather than "clever". There's no end to users' imagination, so we may inadvertedly end up in situations where the volume selection logic does not behave well.

It's not selecting a volume. It's selecting a volume group. The vast majority of systems running MicroShift are only going to have 1 of those.

@dhellmann dhellmann force-pushed the OCPBUGS-9996-lvms-default-vg branch from d3d9d1f to 4ed902f Compare March 14, 2023 19:06
@dhellmann
Copy link
Contributor Author

@ggiguash @copejon , I've changed the logic to prefer "rhel" but choose the single existing VG if it is named otherwise. If there are multiples, then it will require the user to configure the driver explicitly.

@copejon
Copy link
Contributor

copejon commented Mar 14, 2023

@ggiguash @copejon , I've changed the logic to prefer "rhel" but choose the single existing VG if it is named otherwise. If there are multiples, then it will require the user to configure the driver explicitly.

+1 I think that's a pretty reasonable fallback.

@ggiguash
Copy link
Contributor

@ggiguash @copejon , I've changed the logic to prefer "rhel" but choose the single existing VG if it is named otherwise. If there are multiples, then it will require the user to configure the driver explicitly.

I totally agree that's the best approach. Thank you, @dhellmann !

@ggiguash
Copy link
Contributor

/retest

@ggiguash
Copy link
Contributor

@dhellmann, do we need to update GitHub docs with this change? I think we need to at least mention this new logic in the config doc, maybe a CSI design as well.

@dhellmann
Copy link
Contributor Author

@dhellmann, do we need to update GitHub docs with this change? I think we need to at least mention this new logic in the config doc, maybe a CSI design as well.

Good point. See cb4e618. While I was working on that, I made a few other changes in that file in other commits.

@dhellmann
Copy link
Contributor Author

/test e2e-openshift-conformance-reduced

@dhellmann
Copy link
Contributor Author

CI system timed out accessing images

/test e2e-openshift-conformance-reduced

Comment on lines 33 to 38
// DefaultLvmdConfig returns a configuration struct for Lvmd with
// default settings based on the current host. If the default volume
// group name "rhel" is found, that volume group is used, otherwise
// the largest volume group is used. If there are no volume groups,
// the configuration returned will report that it is not enabled (see
// IsEnabled()).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit: the comment needs to be fixed. The code no longer picks the largest volume group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this and added some unit tests in the latest draft.

@dhellmann dhellmann force-pushed the OCPBUGS-9996-lvms-default-vg branch from d978782 to c70ee9c Compare March 16, 2023 19:21
@dhellmann
Copy link
Contributor Author

/test e2e-greenboot

@dhellmann
Copy link
Contributor Author

/retest

1 similar comment
@dhellmann
Copy link
Contributor Author

/retest

Instead of requiring the user to configure their system with a "rhel"
volume group, look at the volume groups that exist and apply some
selection rules:

* If there is a "rhel" VG, use it.
* If there is only one VG, use it.
* Otherwise, disable the CSI driver with a log message.
When there are multiple volume groups, look for one named "microshift"
to use, instead of "rhel".
@ggiguash
Copy link
Contributor

ggiguash commented Mar 19, 2023

@dhellmann , the kutiltest is failing with the following errors in the CI:

   case.go:364: failed in step 30-
    case.go:366: deployments.apps "topolvm-controller" not found
    case.go:366: daemonsets.apps "topolvm-node" not found

Looks like the CI is falling back to the case with CSI disabled.

Mar 18 19:49:28 release-ci-ci-op-tdxlnnbc-7f12f microshift[212336]: infrastructure-services-manager I0318 19:49:28.259617  212336 lvmd.go:70] Multiple volume groups available but no configuration file is present, disabling CSI. [rhel ]
Mar 18 19:49:28 release-ci-ci-op-tdxlnnbc-7f12f microshift[212336]: infrastructure-services-manager I0318 19:49:28.259682  212336 storage.go:81] CSI is disabled. %sMultiple volume groups are available, but no configuration file was provided.

@dhellmann dhellmann force-pushed the OCPBUGS-9996-lvms-default-vg branch from 764fb4c to 01502d3 Compare March 19, 2023 14:20
@ggiguash
Copy link
Contributor

@dhellmann, last error remaining:

# github.com/openshift/microshift/pkg/config/lvmd [github.com/openshift/microshift/pkg/config/lvmd.test]
pkg/config/lvmd/lvmd_test.go:22:5: unknown field 'DisabledReason' in struct literal of type Lvmd
pkg/config/lvmd/lvmd_test.go:90:5: unknown field 'DisabledReason' in struct literal of type Lvmd

Include a message in the body of the file added to the config map to
explain where the content came from.
@dhellmann dhellmann force-pushed the OCPBUGS-9996-lvms-default-vg branch from c5e813a to 5d313f8 Compare March 19, 2023 17:37
@dhellmann
Copy link
Contributor Author

@dhellmann, last error remaining:

# github.com/openshift/microshift/pkg/config/lvmd [github.com/openshift/microshift/pkg/config/lvmd.test]
pkg/config/lvmd/lvmd_test.go:22:5: unknown field 'DisabledReason' in struct literal of type Lvmd
pkg/config/lvmd/lvmd_test.go:90:5: unknown field 'DisabledReason' in struct literal of type Lvmd

That's what I get for not running the unit tests locally before pushing. :-/

@openshift-ci-robot
Copy link

@dhellmann: This pull request references Jira Issue OCPBUGS-9996, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

In response to this:

Instead of requiring the user to configure their system with a "rhel"
volume group, look at the volume groups that exist and pick one as the default. If there is only 1, use it. If there is a "microshift" VG, use it. Otherwise, disable the CSI driver.

/assign @copejon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ggiguash
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, ggiguash

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2023

@dhellmann: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit cd656cd into openshift:main Mar 19, 2023
14 checks passed
@openshift-ci-robot
Copy link

@dhellmann: Jira Issue OCPBUGS-9996: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9996 has been moved to the MODIFIED state.

In response to this:

Instead of requiring the user to configure their system with a "rhel"
volume group, look at the volume groups that exist and pick one as the default. If there is only 1, use it. If there is a "microshift" VG, use it. Otherwise, disable the CSI driver.

/assign @copejon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@dhellmann: new pull request created: #1530

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants