Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add konnectivity proxy sidecar to ingress-operator to ensure it can properly perform in cluster canary healthchecks #1131

Conversation

relyt0925
Copy link
Contributor

@relyt0925 relyt0925 commented Mar 10, 2022

Currently the ingress operator fails to properly perform canary health checks in the guest cluster if it does not have direct network access to the ingress subdomain in the guest cluster. This is not a guarentee to have since the management
cluster and guest cluster can run in a split network environment. This pr introduces the socks proxy which will allow the ingress operator to proxy these canary healthcheck https requests through konnectivity and ultimately into the guest cluster. This will allow the healthchecks to properly be executed in all environments and prevent Degragaded status reports on the ingress resource which can lead to customer concerns/tickets. Fixes: #1130

Relevant logs:

2022-03-10T03:40:37.559Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP request: DNS error: Get \"https://canary-openshift-ingress-canary.example.com\": dial tcp: lookup canary-openshift-ingress-canary.example.com on 172.21.0.10:53: no such host"}


Tylers-MacBook-Pro:rhcos-vpcgen2-ignitiondata tylerlisowski$ kubectl --kubeconfig /tmp/prtesting49-65125390-admin-kubeconfig get clusteroperator | grep ingress
ingress                                    4.9.23    True        False         True       141m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

What this PR does / why we need it:
#1130
Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #
#1130

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@relyt0925 relyt0925 force-pushed the ingress-operator-konnectivity-sidecar branch 4 times, most recently from f0c50c8 to 462b110 Compare March 10, 2022 04:47
@relyt0925
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
@relyt0925 relyt0925 force-pushed the ingress-operator-konnectivity-sidecar branch 3 times, most recently from 90951bb to 63b6e4e Compare March 10, 2022 05:33
@relyt0925
Copy link
Contributor Author

After change:

Tylers-MacBook-Pro:armada-hypershift-operator tylerlisowski$ kubectl --kubeconfig /tmp/prtesting49-65125390-admin-kubeconfig get clusteroperator | grep ingress
ingress                                    4.9.23    True        False         False      3h32m   


@relyt0925
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
@enxebre
Copy link
Member

enxebre commented Mar 10, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022
@@ -118,6 +121,18 @@ func ReconcileDeployment(dep *appsv1.Deployment, params Params, apiPort *int32)
{Name: "IMAGE", Value: params.HAProxyRouterImage},
{Name: "CANARY_IMAGE", Value: params.IngressOperatorImage},
{Name: "KUBECONFIG", Value: "/etc/kubernetes/kubeconfig"},
{
Name: "HTTP_PROXY",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

This breaks outbound connectivity to cloud provider apis which we need for public clouds.

I don't have a solution for this off-hand, but it is a problem we will have to solve for other components in the context of management cluster with proxy support as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it works eventually, probably after the geust cluster has nodes? But this completely negates the advantage of running it in the mgtm cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how it negates it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still get the benefit of it rolling out the resources ahead of time since it can talk to the kube-apis? Which is where I believe the real time benefit is: this just solves a fundamental gap that needs to be solved for health checking on the domain route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvaroaleman brought up a great point that this does call cloud provider APIs: therefore those routes should not be proxied this is added on the latest version.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
@relyt0925 relyt0925 force-pushed the ingress-operator-konnectivity-sidecar branch from 63b6e4e to 950d5b5 Compare March 10, 2022 19:28
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022
@relyt0925 relyt0925 force-pushed the ingress-operator-konnectivity-sidecar branch from 950d5b5 to 5406963 Compare March 10, 2022 19:36
@alvaroaleman
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 10, 2022
@relyt0925
Copy link
Contributor Author

/hold
one final round of tests

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
…roperly perform in cluster canary healthchecks

Currently the ingress operator fails to properly perform canary health checks in the guest cluster if it does not have direct network access to the ingress subdomain in the guest cluster. This is not a guarentee to have since the management
cluster and guest cluster can run in a split network environment. This pr introduces the socks proxy which will allow the ingress operator to proxy these canary healthcheck https requests through konnectivity and ultimately into the guest cluster. This will allow the healthchecks to properly be executed in all environments and prevent Degragaded status reports on the ingress resource which can lead to customer concerns/tickets. Fixes: openshift#1130
@relyt0925 relyt0925 force-pushed the ingress-operator-konnectivity-sidecar branch from 5406963 to 28546dc Compare March 10, 2022 21:26
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022
@relyt0925
Copy link
Contributor Author

Adjusted to .format which is compatible with both curl and go

bash-4.4$ env | grep PROXY
HTTP_PROXY=socks5://127.0.0.1:8090
NO_PROXY=.amazonaws.com,.microsoftonline.com,.azure.com,kube-apiserver
HTTPS_PROXY=socks5://127.0.0.1:8090

bash-4.4$ curl -v https://dynamodb.us-west-2.amazonaws.com
* Rebuilt URL to: https://dynamodb.us-west-2.amazonaws.com/
* Uses proxy env variable NO_PROXY == '.amazonaws.com,.microsoftonline.com,.azure.com,kube-apiserver'
*   Trying 52.94.28.162...
* TCP_NODELAY set
* Connected to dynamodb.us-west-2.amazonaws.com (52.94.28.162) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=dynamodb.us-west-2.amazonaws.com
*  start date: Jun 29 00:00:00 2021 GMT
*  expire date: Jun  6 23:59:59 2022 GMT
*  subjectAltName: host "dynamodb.us-west-2.amazonaws.com" matched cert's "dynamodb.us-west-2.amazonaws.com"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon

Can tell it's going direct since it connects direct to the backend IP

bash-4.4$ curl -v https://login.microsoftonline.com
* Rebuilt URL to: https://login.microsoftonline.com/
* Uses proxy env variable NO_PROXY == '.amazonaws.com,.microsoftonline.com,.azure.com,kube-apiserver'
*   Trying 40.126.7.32...
* TCP_NODELAY set
* Connected to login.microsoftonline.com (40.126.7.32) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation; CN=stamp2.login.microsoftonline.com
*  start date: Mar  4 00:00:00 2022 GMT
*  expire date: Mar  4 23:59:59 2023 GMT
*  subjectAltName: host "login.microsoftonline.com" matched cert's "login.microsoftonline.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
> GET / HTTP/1.1
> Host: login.microsoftonline.com
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Cache-Control: no-store, no-cache
< Pragma: no-cache
< Content-Type: text/html; charset=utf-8
< Expires: -1
< Location: https://www.office.com/login#
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< X-Content-Type-Options: nosniff
< P3P: CP="DSP CUR OTPi IND OTRi ONL FIN"
< x-ms-request-id: 2d0fe0f3-bb3a-4547-b6c8-328fc0000700
< x-ms-ests-server: 2.1.12559.4 - NCUS ProdSlices
< Set-Cookie: fpc=AnRNubvk6PFAsl2QiMrTLGo; expires=Sat, 09-Apr-2022 21:29:05 GMT; path=/; secure; HttpOnly; SameSite=None
< Set-Cookie: esctx=AQABAAAAAAD--DLA3VO7QrddgJg7WevrsJJ-jRbEqfN8xiXScTWx1NTBKiNnwZrtqBR6v26cJikXLv5HQVJM03hfpuPTwr9uSvyhAAuSohTumuL35_QgfP0LcNjGenOVcpYsUUrBeKH8FuofJOEWfSf1s-4O8qtWoalDCeB5tg0-SNJUGtG9mMaEFBUbZj_f5LLm6HoewCsgAA; domain=.login.microsoftonline.com; path=/; secure; HttpOnly; SameSite=None
< Set-Cookie: x-ms-gateway-slice=estsfd; path=/; secure; samesite=none; httponly
< Set-Cookie: stsservicecookie=estsfd; path=/; secure; samesite=none; httponly
< Date: Thu, 10 Mar 2022 21:29:04 GMT
< Content-Length: 146
< 
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://www.office.com/login#">here</a>.</h2>
</body></html>
* Connection #0 to host login.microsoftonline.com left intact
bash-4.4$ 

bash-4.4$ curl -v https://management.azure.com                
* Rebuilt URL to: https://management.azure.com/
* Uses proxy env variable NO_PROXY == '.amazonaws.com,.microsoftonline.com,.azure.com,kube-apiserver'
*   Trying 13.73.240.225...
* TCP_NODELAY set
* Connected to management.azure.com (13.73.240.225) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=WA; L=Redmond; O=Microsoft Corporation; CN=management.azure.com
*  start date: Feb  9 20:36:25 2022 GMT
*  expire date: Feb  4 20:36:25 2023 GMT
*  subjectAltName: host "management.azure.com" matched cert's "management.azure.com"
*  issuer: C=US; O=Microsoft Corporation; CN=Microsoft Azure TLS Issuing CA 05
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55eb1d7eb730)
> GET / HTTP/2
> Host: management.azure.com
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 400 
< cache-control: no-cache
< pragma: no-cache
< content-type: application/json; charset=utf-8
< expires: -1
< x-ms-failure-cause: gateway
< x-ms-request-id: df65fbc5-06a4-4528-be27-eccb4a749024
< x-ms-correlation-request-id: df65fbc5-06a4-4528-be27-eccb4a749024
< x-ms-routing-request-id: SOUTHCENTRALUS:20220310T213010Z:df65fbc5-06a4-4528-be27-eccb4a749024
< strict-transport-security: max-age=31536000; includeSubDomains
< x-content-type-options: nosniff
< date: Thu, 10 Mar 2022 21:30:10 GMT
< content-length: 137
< 
* Connection #0 to host management.azure.com left intact
{"error":{"code":"MissingApiVersionParameter","message":"The api-version query parameter (?api-version=) is required for all requests."}}bash-4.4$ 

@relyt0925
Copy link
Contributor Author

Whereas on one that is proxied

curl -v https://google.com. * Unwillingly accepted illegal URL using 3 slashes!
* Rebuilt URL to: https://google.com/
* Uses proxy env variable NO_PROXY == '.amazonaws.com,.microsoftonline.com,.azure.com,kube-apiserver'
* Uses proxy env variable HTTPS_PROXY == 'socks5://127.0.0.1:8090'
*   Trying 127.0.0.1...
* TCP_NODELAY set
* SOCKS5 communication to google.com:443
* SOCKS5 connect to IPv4 142.251.32.206 (locally resolved)
* SOCKS5 request granted.
* Connected to 127.0.0.1 (127.0.0.1) port 8090 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.google.com
*  start date: Feb 17 10:22:00 2022 GMT
*  expire date: May 12 10:21:59 2022 GMT
*  subjectAltName: host "google.com" matched cert's "google.com"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1C3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use

@relyt0925
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, relyt0925

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2022

@relyt0925: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9fd6e1d into openshift:main Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
4 participants