Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-16721: handle hosts with no luks #5940

Merged
merged 1 commit into from Feb 7, 2024

Conversation

paul-maidment
Copy link
Contributor

@paul-maidment paul-maidment commented Feb 1, 2024

In line 41 of internal/host/hostcommands/tang_connectivity_check_cmd.go
Both luks and luks.Clevis are pointers which can be nil, there was no checking in place to ensure that we avoid a nil pointer dereference

This PR fixes that by making appropriate checks to ensure that neither of these pointers are nil.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 1, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 1, 2024

@paul-maidment: This pull request references MGMT-16721 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

In line 41 of internal/host/hostcommands/tang_connectivity_check_cmd.go Both luks and luks.Clevis are pointers which can be nil, there was no checking in place to ensure that we avoid a nil pointer dereference

This PR fixes that by makign appropriate checks to ensure that neither of these pointers are nil.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 1, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2024
@gamli75
Copy link
Contributor

gamli75 commented Feb 1, 2024

@rccrdpccl please review

@paul-maidment
Copy link
Contributor Author

/retest

@omertuc
Copy link
Contributor

omertuc commented Feb 1, 2024

What does it mean for Luks to be present in an ignition file but for Clevis to be nil? Is that a valid situation? What is the expected behavior from the service in that situation? To ignore it? To report an error?

Please elaborate

@rccrdpccl
Copy link
Contributor

What does it mean for Luks to be present in an ignition file but for Clevis to be nil? Is that a valid situation? What is the expected behavior from the service in that situation? To ignore it? To report an error?

Please elaborate

Could you please also change description of the issue and commit according to the answer for the question above :)?

@rccrdpccl
Copy link
Contributor

/retest

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (df15ca0) 68.26% compared to head (9a82bd4) 68.32%.
Report is 12 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5940      +/-   ##
==========================================
+ Coverage   68.26%   68.32%   +0.05%     
==========================================
  Files         236      236              
  Lines       34913    34922       +9     
==========================================
+ Hits        23833    23859      +26     
+ Misses       9006     8995      -11     
+ Partials     2074     2068       -6     
Files Coverage Δ
internal/host/validator.go 81.77% <ø> (ø)
...l/host/hostcommands/tang_connectivity_check_cmd.go 76.19% <40.00%> (+4.03%) ⬆️

... and 6 files with indirect coverage changes

@paul-maidment
Copy link
Contributor Author

What does it mean for Luks to be present in an ignition file but for Clevis to be nil? Is that a valid situation? What is the expected behavior from the service in that situation? To ignore it? To report an error?

Please elaborate

I just observed that both luks and clevis are pointers and can have a value of nil.
Simply based on the fact that pointers can be nil, (whether or not this is valid) is reasonable grounds to check for nil I believe.

@omertuc
Copy link
Contributor

omertuc commented Feb 4, 2024

What does it mean for Luks to be present in an ignition file but for Clevis to be nil? Is that a valid situation? What is the expected behavior from the service in that situation? To ignore it? To report an error?
Please elaborate

I just observed that both luks and clevis are pointers and can have a value of nil. Simply based on the fact that pointers can be nil, (whether or not this is valid) is reasonable grounds to check for nil I believe.

I'm in favor of checking, but we perform a check so we can branch out our behavior.

In case of A, do X. In case of B, do Y.

In this PR it's not clear to me how you decided on X and Y, as I don't understand the circumstances that lead to A and B and the reasoning behind what is the correct behavior for the service in each of these circumstances

Or in other words, why return nil, nil ? (not to say that it's the wrong thing to do, just that it's not clear to me, it doesn't seem obvious, and so it should be thought out and pointed out in a code comment)

@omertuc
Copy link
Contributor

omertuc commented Feb 4, 2024

If you're not sure, even something like // TODO: not sure how this could happen or whether its possibly valid, for now pretend encryption is disabled just so we avoid a nil pointer dereference is also completely valid, as it's obviously improving the previous situation by avoiding a crash, but at least we can recognize that this needs to be thought out more thoroughly and needs looking into

@paul-maidment
Copy link
Contributor Author

paul-maidment commented Feb 4, 2024

Even something like // TODO: not sure how this could happen or whether its possibly valid, for now pretend encryption is disabled just so we avoid a nil pointer dereference is also completely valid, as it's obviously improving the previous situation by avoiding a crash, but at least we can recognize that this needs to be thought out more thoroughly and needs looking into

I'll go with this suggestion and will also add test cases for luks == nil and luks.clevis == nil, just to ensure we have the best coverage.

Seems to me that just because a cluster is imported does not necessarily imply that there needs to be disk encryption, maybe we are simply covering the case where an imported cluster does not have disk encryption?

Certainly the case of missing luks we can imagine this to be the case

Missing luks.Clevis I do agree that it's odd this should be optional.

So for now, I will go with the comment you suggest.

I'll also add a warning to the log about this scenario.

@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 4, 2024
@paul-maidment paul-maidment force-pushed the MGMT-16721 branch 2 times, most recently from 40efb88 to f8a2ff2 Compare February 4, 2024 14:34
@@ -38,6 +38,11 @@ func (c *tangConnectivityCheckCmd) getTangServersFromHostIgnition(host *models.H
if err != nil {
return nil, err
}
if luks == nil || luks.Clevis == nil {
// TODO: not sure how this could happen or whether its possibly valid, for now pretend encryption is disabled just so we avoid a nil pointer dereference
c.log.Warn("luks configuration is missing or incomplete in the host ignition, disk encryption will be assumed to be disabled.")
Copy link
Contributor

@omertuc omertuc Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

luks being nil is pretty common, it doesn't warrant a warning. We'll get a warning every time a cluster doesn't have encryption (at-least, that's what it I think would happen, not sure, need to look at ignitions from actual clusters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, removed the log entry.

Copy link
Contributor

@omertuc omertuc Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about only luks being nil, which doesn't warrant a warning.

But luks being non-nil and Clevis nil does warrant a warning and a code comment, that's the case I've been discussing in all my comments so far, the one I don't understand and I'm not sure how we should react to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated to deal with that scenario.

@paul-maidment paul-maidment force-pushed the MGMT-16721 branch 2 times, most recently from ea7a108 to 6291ed4 Compare February 4, 2024 14:44
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2024

@paul-maidment: This pull request references MGMT-16721 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

In line 41 of internal/host/hostcommands/tang_connectivity_check_cmd.go Both luks and luks.Clevis are pointers which can be nil, there was no checking in place to ensure that we avoid a nil pointer dereference

This PR fixes that by making appropriate checks to ensure that neither of these pointers are nil.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@paul-maidment paul-maidment force-pushed the MGMT-16721 branch 3 times, most recently from 0a6659d to c4b2913 Compare February 4, 2024 16:54
@rccrdpccl
Copy link
Contributor

/unhold
/lgtm
/approve

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2024
@omertuc
Copy link
Contributor

omertuc commented Feb 7, 2024

So in the case of us receiving a nil in this field, I suspect that we should raise an error because this means that we wont be able to boot the LUKS volume as we lack the key.

Does this make sense?

Not really, we're discussing validations for adding hosts to an existing cluster here.

This entire ordeal of looking at the ignition to get tang details is just so we can perform a validation that the tang config specified in the target cluster actually works on this new day-2 host - we know how to recognize and validate tang, so we look into the clevis struct to get tang servers (when those are present), so we can send those details to our agent so it can attempt to connect to the tang servers, followed by the agent reporting about what it experienced doing that, so that our validation can ensure tang works.

If you see the cluster has encryption (i.e. has the luks struct) but it's not using using the clevis struct, and you throw an error because you don't recognize it, then you are implicitly making an assumption of how that encryption works (in particular, you're assuming that it wouldn't work and it deserves and error just because you have no way to validate that it works), but perhaps it just means that our validation should be skipped (both in the tang_connectivity_check_cmd.go and the corresponding validation code that awaits the results from the agent).

The reasoning of skipping instead of giving an error is that the validation was best-effort in the first place - we're doing it as a "favor" to the user. We can validate tang so we do. If the user uses some other scheme we don't recognize, we should still let them add the host to the cluster, and if they run into issues, that's unfortunate, but it's none of our business, because it just might work. We can't assume it won't because we don't know what it is

Also regardless of whether you simply return nil as you did, or return an error - a user's day-2 host installation will still hang - as you never modified the corresponding validation code that is waiting on those tang results from the agent

@rccrdpccl
Copy link
Contributor

/hold

Holding waiting for clarification on this comment:
Also regardless of whether you simply return nil as you did, or return an error - a user's day-2 host installation will still hang - as you never modified the corresponding validation code that is waiting on those tang results from the agent

Is this changing current behaviour, other than fixing the nil pointer case when luks not present?
Just want to make sure we do not add more issues

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2024
@paul-maidment
Copy link
Contributor Author

paul-maidment commented Feb 7, 2024

tang_connectivity_check_cmd

Regarding the validation of disk encryption requirements.

It seems that firstly we fetch host ignition and validate that luks is not nil and contains at least one entry.
https://github.com/openshift/assisted-service/blob/master/internal/host/hostutil/host_utils.go#L326
(I also call the above method from my code, so that's common)

https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L422-L425
Then we check to see if the clevis setting is present, bypassing validation and marking it as successful if these settings are not present

if luks == nil || luks.Clevis == nil {
	// Disk encryption is disabled for workers on day1 cluster
	return ValidationSuccessSuppressOutput, ""
}

the comment // Disk encryption is disabled for workers on day1 cluster is a little confusing as a few lines above, we have already determined that we are looking at a day 2 host.

With this aside, we can see how a Day2 host without Clevis is treated. This is treated as though there is no disk encryption.
Which to me at least, looks like the same logic I implemented in this PR?

@omertuc
Copy link
Contributor

omertuc commented Feb 7, 2024

the comment // Disk encryption is disabled for workers on day1 cluster is a little confusing as a few lines above, we have already determined that we are looking at a day 2 host.

By "day 1 cluster" this comment actually means "target cluster"

@paul-maidment
Copy link
Contributor Author

the comment // Disk encryption is disabled for workers on day1 cluster is a little confusing as a few lines above, we have already determined that we are looking at a day 2 host.

By "day 1 cluster" this comment actually means "target cluster"

Should we change the above comment to make this a lot clearer?

@omertuc
Copy link
Contributor

omertuc commented Feb 7, 2024

With this aside, we can see how a Day2 host without Clevis is treated. This is treated as though there is no disk encryption.
Which to me at least, looks like the same logic I implemented in this PR?

Yes, it's the same.

The comment however is making the false statement that because clavis is nil then there's no encryption. Which is a wrong thing to say as it's too general, there's no tang servers to validate, doesn't mean there's no encryption. But its behavior is still correct since its goal is to validate tang, and clavis being nil does mean there's no tang servers to validate

@omertuc
Copy link
Contributor

omertuc commented Feb 7, 2024

Should we change the above comment to make this a lot clearer?

Sure

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2024
@paul-maidment
Copy link
Contributor Author

paul-maidment commented Feb 7, 2024

/hold

Holding waiting for clarification on this comment: Also regardless of whether you simply return nil as you did, or return an error - a user's day-2 host installation will still hang - as you never modified the corresponding validation code that is waiting on those tang results from the agent

Is this changing current behavior, other than fixing the nil pointer case when luks not present? Just want to make sure we do not add more issues

In short...zero behavioral change. Our existing validations expect these fields to be set otherwise we assume that disk encryption is disabled.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 7, 2024

@paul-maidment: This pull request references MGMT-16721 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

In line 41 of internal/host/hostcommands/tang_connectivity_check_cmd.go
Both luks and luks.Clevis are pointers which can be nil, there was no checking in place to ensure that we avoid a nil pointer dereference

This PR fixes that by making appropriate checks to ensure that neither of these pointers are nil.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

In line 41 of internal/host/hostcommands/tang_connectivity_check_cmd.go
Both `luks` and `luks.Clevis` are pointers which can be nil, there was no checking in place to ensure that we avoid a nil pointer dereference

This PR fixes that by making appropriate checks to ensure that neither of these pointers are nil.
@rccrdpccl
Copy link
Contributor

Thanks for clarifying @paul-maidment
/unhold
/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 7, 2024
Copy link

openshift-ci bot commented Feb 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paul-maidment, rccrdpccl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [paul-maidment,rccrdpccl]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD eb3ed40 and 2 for PR HEAD 9a82bd4 in total

@paul-maidment
Copy link
Contributor Author

/retest

Copy link

openshift-ci bot commented Feb 7, 2024

@paul-maidment: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit bc922ba into openshift:master Feb 7, 2024
14 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-agent-installer-api-server-container-v4.16.0-202402072112.p0.gbc922ba.assembly.stream.el8 for distgit ose-agent-installer-api-server.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants