Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-26762: Disable DNS resolving for CNO #3986

Merged
merged 2 commits into from
May 9, 2024

Conversation

kyrtapz
Copy link
Contributor

@kyrtapz kyrtapz commented May 6, 2024

CNO uses the konnectivity-socks5-proxy to perform cluster-wide-proxy readiness checks through the hosted cluster's network.
The cluster-wide-proxy address it uses should not be resolved to avoid double-proxy issues.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 6, 2024
@openshift-ci-robot
Copy link

@kyrtapz: This pull request references Jira Issue OCPBUGS-26762, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

CNO uses the konnectivity-socks5-proxy to perform cluster-wide-proxy readiness checks through the hosted cluster's network.
The cluster-wide-proxy address it uses should not be resolved to avoid double-proxy issues.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 6, 2024
@openshift-ci openshift-ci bot requested review from csrwng and enxebre May 6, 2024 10:30
@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 6, 2024
@kyrtapz
Copy link
Contributor Author

kyrtapz commented May 6, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 6, 2024
@openshift-ci-robot
Copy link

@kyrtapz: This pull request references Jira Issue OCPBUGS-26762, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kyrtapz
Copy link
Contributor Author

kyrtapz commented May 6, 2024

/retest

@stevekuznetsov
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2024
@kyrtapz
Copy link
Contributor Author

kyrtapz commented May 6, 2024

/cc @sjenning

@openshift-ci openshift-ci bot requested a review from sjenning May 6, 2024 17:44
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@@ -265,7 +269,7 @@ type proxyResolver struct {

func (d proxyResolver) Resolve(ctx context.Context, name string) (context.Context, net.IP, error) {
// Preserve the host so we can recognize it
if isCloudAPI(name) {
if isCloudAPI(name) || d.disableResolver {
Copy link
Member

@enxebre enxebre May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't perform dns resolution but dialling would sill go through the guest cluster and so preserving the issue described in the jira. What am I missing?

To overcome the issue you could use resolve-from-management-cluster-dns causing dialling to happen through the management cluster but then that would defeat the purpose of the healthcheck?
Has the CNO change the healthcheck approach to e.g. run a pod on the guest cluster or similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is there a way we can include healthchecking as part of our TestCreateClusterProxy validations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't perform dns resolution but dialling would sill go through the guest cluster and so preserving the issue described in the jira. What am I missing?

That is intentional. The readiness check has to be performed from the guest cluster but we want to dial to the exact cluster-wide-proxy address.
I know the issue is confusing so lets consider what was happening before:

  1. CNO dials to the cluster-wide-proxy fqdn using the local socks5 proxy.
  2. socks5 proxy resolves the fqdn to an IP address and sends it to the guest cluster through konnectivity
  3. konnectivity agent in the guest cluster has the cluster-wide-proxy fqdn in no_proxy but the cluster-wide-proxy ip is not there. It tries to dial to the cluster-wide-proxy IP through the configured proxy causing a reject from the proxy server.

With this change the behavior changes to:

  1. CNO dials to the cluster-wide-proxy fqdn using the local socks5 proxy.
  2. socks5 proxy doesn't resolve the fqdn and sends it to the guest cluster through konnectivity unchanged.
  3. konnectivity agent in the guest cluster has the cluster-wide-proxy fqdn in no_proxy so when it dials to it it does it directly.

@enxebre
Copy link
Member

enxebre commented May 8, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2024
@sjenning
Copy link
Contributor

sjenning commented May 8, 2024

/approve

Copy link
Contributor

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024
@kyrtapz
Copy link
Contributor Author

kyrtapz commented May 8, 2024

/retest

…socks5-proxy

Signed-off-by: Patryk Diak <pdiak@redhat.com>
CNO uses the konnectivity proxy to perform cluster wide proxy
readiness checks. It should always connect to the exact address
and not to the IP to avoid double proxy issues.

Signed-off-by: Patryk Diak <pdiak@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 8, 2024
@sjenning
Copy link
Contributor

sjenning commented May 8, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 30fc457 and 2 for PR HEAD a41d9a7 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 37b99f5 and 1 for PR HEAD a41d9a7 in total

@sjenning
Copy link
Contributor

sjenning commented May 8, 2024

/retest-required

Copy link
Contributor

openshift-ci bot commented May 9, 2024

@kyrtapz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure a41d9a7 link false /test e2e-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit c8af049 into openshift:main May 9, 2024
12 of 13 checks passed
@openshift-ci-robot
Copy link

@kyrtapz: Jira Issue OCPBUGS-26762: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-26762 has been moved to the MODIFIED state.

In response to this:

CNO uses the konnectivity-socks5-proxy to perform cluster-wide-proxy readiness checks through the hosted cluster's network.
The cluster-wide-proxy address it uses should not be resolved to avoid double-proxy issues.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kyrtapz
Copy link
Contributor Author

kyrtapz commented May 10, 2024

/cherry-pick release-4.15 release-4.14

@openshift-cherrypick-robot

@kyrtapz: new pull request created: #4015

In response to this:

/cherry-pick release-4.15 release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-05-14-095225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants