Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-22399: Disable UWM Telemetry remote writer when MGMT cluster is disconnected #3332

Merged
merged 1 commit into from Dec 20, 2023

Conversation

jparrill
Copy link
Contributor

@jparrill jparrill commented Dec 18, 2023

This PR is disabling the UWM telemetry remote writer controller when the management cluster is working in disconnected mode. We asume that this mode is enabled when the first Service network entry is an IPv6, being that entry the first citizen for OVN.

Follow up To-Do:

  • When the bare metal tests are in place, add a validation to check the disabling of this component on IPv6 disconnected

Signed-off-by: Juan Manuel Parrilla Madrid jparrill@redhat.com

Which issue(s) this PR fixes
Fixes #OCPBUGS-22399

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 18, 2023
@openshift-ci-robot
Copy link

@jparrill: This pull request references Jira Issue OCPBUGS-22399, which is invalid:

  • expected the bug to target only the "4.16.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR is disabling the UWM telemetry remote writer controller when the management cluster is working in disconnected mode. We asume that this mode is enabled when the first Service network entry is an IPv6, being that entry the first citizen for OVN.

Follow up To-Do:

  • [] When the bare metal tests are in place, add a validation to check the disabling of this component on IPv6 disconnected

Signed-off-by: Juan Manuel Parrilla Madrid jparrill@redhat.com

Which issue(s) this PR fixes
Fixes #OCPBUGS-22399

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Dec 18, 2023
Copy link

netlify bot commented Dec 18, 2023

Deploy Preview for hypershift-docs ready!

Name Link
🔨 Latest commit bd8c922
🔍 Latest deploy log https://app.netlify.com/sites/hypershift-docs/deploys/658312d2f580f40008c60e73
😎 Deploy Preview https://deploy-preview-3332--hypershift-docs.netlify.app/how-to/disconnected/known-issues
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jparrill
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 18, 2023
@openshift-ci-robot
Copy link

@jparrill: This pull request references Jira Issue OCPBUGS-22399, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @LiangquanLi930

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jparrill
Copy link
Contributor Author

/retest

@jparrill
Copy link
Contributor Author

/retest required

Copy link
Contributor

openshift-ci bot commented Dec 19, 2023

@jparrill: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test e2e-aws
  • /test e2e-kubevirt-aws-ovn
  • /test images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test e2e-aws-4-12
  • /test e2e-aws-4-13
  • /test e2e-aws-metrics
  • /test e2e-conformance
  • /test e2e-ibmcloud-iks
  • /test e2e-ibmcloud-roks

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-hypershift-main-e2e-aws
  • pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn
  • pull-ci-openshift-hypershift-main-images
  • pull-ci-openshift-hypershift-main-unit
  • pull-ci-openshift-hypershift-main-verify

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jparrill
Copy link
Contributor Author

/retest-required

@jparrill
Copy link
Contributor Author

/test e2e-aws

@Patryk-Stefanski
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 20, 2023
@jparrill
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link

@jparrill: This pull request references Jira Issue OCPBUGS-22399. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

This PR is disabling the UWM telemetry remote writer controller when the management cluster is working in disconnected mode. We asume that this mode is enabled when the first Service network entry is an IPv6, being that entry the first citizen for OVN.

Follow up To-Do:

  • When the bare metal tests are in place, add a validation to check the disabling of this component on IPv6 disconnected

Signed-off-by: Juan Manuel Parrilla Madrid jparrill@redhat.com

Which issue(s) this PR fixes
Fixes #OCPBUGS-22399

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 20, 2023
@jparrill jparrill reopened this Dec 20, 2023
@openshift-ci-robot
Copy link

@jparrill: This pull request references Jira Issue OCPBUGS-22399, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @LiangquanLi930

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR is disabling the UWM telemetry remote writer controller when the management cluster is working in disconnected mode. We asume that this mode is enabled when the first Service network entry is an IPv6, being that entry the first citizen for OVN.

Follow up To-Do:

  • When the bare metal tests are in place, add a validation to check the disabling of this component on IPv6 disconnected

Signed-off-by: Juan Manuel Parrilla Madrid jparrill@redhat.com

Which issue(s) this PR fixes
Fixes #OCPBUGS-22399

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

if opts.EnableUWMTelemetryRemoteWrite {
// to remotely write telemetry metrics. The UWM stack will be disabled if the
// cluster is IPv6.
mgmtNetworkConfig := manifests.OpenshiftNetworkConfiguration()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we disabling telemetry for all clusters having ipv6 in the management cluster?

Copy link
Contributor Author

@jparrill jparrill Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are assuming that if the first entry of the serviceNetwork in the MGMT cluster is IPv6, the HostedClusters are disconnected. As I discussed with @sjenning I don't know any other better way to infer if the cluster is disconnected or not. Any suggestion?

if err := mgr.GetClient().Get(ctx, crclient.ObjectKeyFromObject(mgmtNetworkConfig), mgmtNetworkConfig); err != nil {
return fmt.Errorf("failed to get network config resource: %w", err)
}
if opts.EnableUWMTelemetryRemoteWrite && !utilsnet.IsIPv6String(mgmtNetworkConfig.Spec.ServiceNetwork[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error that this would though in disconnected atm? it's not clear by reading the ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the HO pod get blocked in the reconciliation and the log shows:

{"level":"error","ts":"2023-12-20T15:23:01Z","msg":"Reconciler error","controller":"deployment","controllerGroup":"apps","controllerKind":"Deployment","Deployment":{"name":"operator","namespace":"hypershift"},"namespace":"hypershift","name":"operator","reconcileID":"451fde3c-eb1b-4cf0-98cb-ad0f8c6a6288","error":"cannot get telemeter client secret: Secret \"telemeter-client\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

{"level":"debug","ts":"2023-12-20T15:23:01Z","logger":"events","msg":"Failed to ensure UWM telemetry remote write: cannot get telemeter client secret: Secret \"telemeter-client\" not found","type":"Warning","object":{"kind":"Deployment","namespace":"hypershift","name":"operator","uid":"c6628a3c-a597-4e32-875a-f5704da2bdbb","apiVersion":"apps/v1","resourceVersion":"4091099"},"reason":"ReconcileError"}

It's intrusive in the sense of that block the HC reconciliation not just the normal functioning of the operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the HO pod get blocked in the reconciliation

Can you clarify why? The error you shared is from a different controller, it shouldn't block HC reconciliation

The error you shared says it can't get a secret, how is that related to disconnected?

Because we are assuming that if the first entry of the serviceNetwork in the MGMT cluster is IPv6, the HostedClusters are disconnected

I don't think that's a fair assumption. If --enable-uwm-telemetry-remote-write is enabled and telemetry is not reached because it's disconnected we should just log it.

@sjenning
Copy link
Contributor

/hold

I think we just want the user to explicitly tell us with the existing flag, not intuit the intent through the ServiceNetwork being Iv6

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2023
@jparrill
Copy link
Contributor Author

Ok, makes sense, I will convert this PR in a more documentation oriented one, and engage ACM/MCE people in the loop in order to get this documented in their side also.

@openshift-ci openshift-ci bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Dec 20, 2023
docs/content/how-to/disconnected/known-issues.md Outdated Show resolved Hide resolved
docs/content/how-to/disconnected/known-issues.md Outdated Show resolved Hide resolved
…cted envs

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@sjenning
Copy link
Contributor

/approve
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2023
@sjenning
Copy link
Contributor

/lgtm

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 20, 2023
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6c26740 and 2 for PR HEAD bd8c922 in total

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 146072b into openshift:main Dec 20, 2023
12 checks passed
@openshift-ci-robot
Copy link

@jparrill: Jira Issue OCPBUGS-22399: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-22399 has been moved to the MODIFIED state.

In response to this:

This PR is disabling the UWM telemetry remote writer controller when the management cluster is working in disconnected mode. We asume that this mode is enabled when the first Service network entry is an IPv6, being that entry the first citizen for OVN.

Follow up To-Do:

  • When the bare metal tests are in place, add a validation to check the disabling of this component on IPv6 disconnected

Signed-off-by: Juan Manuel Parrilla Madrid jparrill@redhat.com

Which issue(s) this PR fixes
Fixes #OCPBUGS-22399

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-hypershift-container-v4.16.0-202312210011.p0.g146072b.assembly.stream for distgit hypershift.
All builds following this will include this PR.

@jparrill
Copy link
Contributor Author

jparrill commented Jan 8, 2024

/cherry-pick release-4.15

@openshift-cherrypick-robot

@jparrill: new pull request created: #3380

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-01-09-085011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants