Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node controller to HCCO Manager #1702

Merged

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Aug 25, 2022

What this PR does / why we need it:
What:
This introduces a feature to reconcile Nodes with an opinionated Hypershift label pointing to the owning NodePool.

Why:
This is useful to list and filter Nodes by NodePool. This is useful for all operators to have a single and consistent bidirectional way for NodePool<->Node without having to reimplement the same logic, vendoring clients and consuming compute resources. E.g NTO tuneD.

How:
Change the NodePool controller to propagate the NodePool annotation down to Machines.
Introduce a Node controller within the HCCO manager:

  • Watches Nodes.
  • Find the Machine by using the appropriate CAPI annotations.
  • Find the NodePool annotation.
  • Apply the annotation as a Label into the Node.

In future we might extend this to support label propagation into Nodes via NodePool API.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
ref https://issues.redhat.com/browse/HOSTEDCP-545

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@enxebre
Copy link
Member Author

enxebre commented Aug 25, 2022

/hold
to do some additional manual test.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2022
@enxebre
Copy link
Member Author

enxebre commented Aug 25, 2022

cc @dagrayvid

@openshift-ci openshift-ci bot requested review from csrwng and sjenning August 25, 2022 12:23
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2022
@enxebre enxebre force-pushed the propagate-nodepool-label branch 4 times, most recently from 9a0ef0e to 37399a9 Compare August 25, 2022 14:58
@dagrayvid
Copy link
Contributor

@enxebre, I don't think we can label nodes with the full / of the NodePools, since '/' characters aren't allowed for labels, for instance:

oc label node/ip-10-0-134-129.ec2.internal "hypershift.openshift.io/nodePool"="namespace/name"
error: invalid label value: "hypershift.openshift.io/nodePool=namespace/name": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

Unless I'm missing something, this doesn't account for that yet. Will we need 2 separate labels, one for namespace and one for name?

@enxebre
Copy link
Member Author

enxebre commented Aug 26, 2022

Unless I'm missing something, this doesn't account for that yet. Will we need 2 separate labels, one for namespace and one for name?

Only the name is needed to filter Nodes, updated.

@enxebre
Copy link
Member Author

enxebre commented Aug 26, 2022

/hold cancel
/test e2e-aws-nested

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2022
@dagrayvid
Copy link
Contributor

dagrayvid commented Aug 26, 2022

@enxebre for NTO we have 2 uses for these labels:

  • When choosing the TuneD profile for each Node, only consider TuneD profiles from the Tuned objects referenced in the Nodes NodePool spec. (Compare NodePool label on the Node with labels on the Tuned objects)
  • When generating MachineConfigs to pass back to the NodePool controller for setting kernel boot parameters (via ConfigMap), label the ConfigMap with the NodePool name (and namespace?) that the MachineConfig should be applied to.

In this second case, I think we likely need the name and the namespace, to make sure NTO and the NodePool controller are distinguishing between NodePools of the same name from separate namespaces.

Perhaps we could separate name and namespace by '_', as this is supported in label values but not allowed in object names? Or we could have two labels, one for name and one for namespace of the NodePool.

/cc @csrwng as this is related to the comment thread here and I would probably like to use the same convention if we end up going one of these routes.

@enxebre
Copy link
Member Author

enxebre commented Aug 26, 2022

When generating MachineConfigs to pass back to the NodePool controller for setting kernel boot parameters (via ConfigMap), label the ConfigMap with the NodePool name (and namespace?) that the MachineConfig should be applied to.

The namespace is not needed, the relation goes in the other direction, the NodePool controller is aware of the cp namespace where its config live for each NodePool (in this case the config auto generated by the NTO).

@dagrayvid
Copy link
Contributor

The namespace is not needed, the relation goes in the other direction, the NodePool controller is aware of the cp namespace where its config live for each NodePool (in this case the config auto generated by the NTO).

Ack, that makes sense. As long as it is not supported to have two NodePools of the same name in different namespaces for the same hosted cluster, we should be fine with just the name

@enxebre
Copy link
Member Author

enxebre commented Aug 29, 2022

/test e2e-aws-nested

Copy link
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @enxebre. Some comments.

)

const (
nodePoolAnnotation = "hypershift.openshift.io/nodePool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nodePoolAnnotation/nodePoolLabel/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be part of the API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's both annotation and label. I updated to move the label declaration into API.

var apiErr *apierrors.StatusError
nodePoolName, err := r.nodeToNodePoolName(node)
if err != nil {
if errors.As(err, &apiErr) && !apierrors.IsNotFound(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/IsNotFound(err)/IsNotFound(apiErr)/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is right as it is, isn't it? I want to pass the err I got through apierrors.IsNotFound check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the check would only work on StatusError no? (otherwise I don't see the point of checking that err contains a StatusError)

// annotations are not in place yet, so we'll reconcile triggered by the event which sets them in the Node.
return ctrl.Result{}, err
} else {
log.Error(err, "failed to get nodePool name from Node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add node name to error args

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should come with logger as it's part of the request.

error: true,
},
{
name: "When Machine does not exist it should fail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this describes the case above, this case looks like "When node has no annotations"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

},
},
}
machineWithOutNodePoolAnnotation := &capiv1.Machine{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used in any of the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

t.Fatalf("failed to list nodes in guest cluster: %v", err)
}
for _, node := range nodes.Items {
g.Expect(node.Labels["hypershift.openshift.io/nodePool"]).NotTo(BeEmpty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant (esp if you move it to the API)

What:
This introduces a feature to reconcile Nodes with an opinionated Hypershift label pointing to the owning NodePool.

Why:
This is useful to list and filter Nodes by NodePool. This is useful for all operators to have a single and consistent bidirectional way for NodePool<->Node without having to reimplement the same logic, vendoring clients and consuming compute resources. E.g NTO tuneD.

How:
Change the NodePool controller to propagate the NodePool annotation down to Machines.
Introduce a Node controller within the HCCO manager:
- Watches Nodes.
- Find the Machine by using the appropriate CAPI annotations.
- Find the NodePool annotation.
- Apply the annotation as a Label into the Node.

In future we might extend this to support label propagation into Nodes via NodePool API.
@netlify
Copy link

netlify bot commented Aug 29, 2022

Deploy Preview for hypershift-docs ready!

Name Link
🔨 Latest commit 6e484d5
🔍 Latest deploy log https://app.netlify.com/sites/hypershift-docs/deploys/630cd53246343f0008fe7a31
😎 Deploy Preview https://deploy-preview-1702--hypershift-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@csrwng
Copy link
Contributor

csrwng commented Aug 29, 2022

/lgtm

Looks like we're hitting AWS limits:

NatGatewayLimitExceeded: Performing this operation would exceed the limit of 80 NAT gateways

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2022
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD b242dee and 8 for PR HEAD 6e484d5 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2022

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/capi-provider-agent-sanity 6e484d5 link false /test capi-provider-agent-sanity
ci/prow/e2e-aws-nested c7b8dd9 link true /test e2e-aws-nested
ci/prow/e2e-kubevirt-gcp-ovn 6e484d5 link false /test e2e-kubevirt-gcp-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants