Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

Closed
WuJun2016 opened this issue Apr 28, 2022 · 4 comments
Assignees
Labels
area/backend kind/bug Issues that are defects reported by users or that we know have reached a real release priority/1 Highly recommended to fix in this release

Comments

@WuJun2016
Copy link
Contributor

Describe the bug

image.png

To Reproduce
Steps to reproduce the behavior:

  1. Go to Setting page
  2. edit backup-target
  3. Click on the "here" link at the bottom (The page will jump to a new tab page)
    image.png
    image.png
  4. You can reopen a new page and enter https://URL/v1/harvester/network.harvesterhci.io.clusternetworks (At this point the api is able to request successfully)
  5. go to the previously opened additional-ca edit page and enter any value, click save button (Immediately go to refresh the https://URL/v1/harvester/network.harvesterhci.io.clusternetworks page, the page does not respond for a long time)

Expected behavior

Support bundle

Environment:

  • Harvester ISO version:
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context
Add any other context about the problem here.

@WuJun2016 WuJun2016 added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Apr 28, 2022
@guangbochen guangbochen added priority/1 Highly recommended to fix in this release area/backend labels Apr 28, 2022
@guangbochen guangbochen added this to the v1.0.2 milestone Apr 28, 2022
@weihanglo weihanglo self-assigned this May 3, 2022
@weihanglo
Copy link
Contributor

weihanglo commented May 3, 2022

Just did a simple research. If we provide a valid CA certificate, it succeeds without any error. Otherwise it behaves as what WuJun2016 described.

I use the following commands to create a valid certificate.

openssl genrsa 2048 > ca-key.pem
openssl req -new -x509 -nodes -days 365000 -key ca-key.pem -out ca-cert.pem

The current backend code (harvester-webhook) only adds addition-ca to root CA for its HTTPS transport. Nothing seems weird to me so far. However, I observed some fun facts

  1. Changing additional-ca causes the harvester pod terminated and restarted.
  2. The timing of harvester pod starting seems like at the websocket subscription.
  3. An invalid value of addtional-ca causes the pod restarting immediately, so the user experiences hanging and 503 error right after applying the change.

I wonder we can avoid the downtime if we leverage tls.Config.GetCertificate. Will do some experiments based on these findings.

@weihanglo
Copy link
Contributor

After a deeper investigation, I found that the root cause is that harvester deployment is forced to restart when additional CA cert changes 1. The reason to restart is that podmutator in harvester-webhook would eject and mount a volume of the certain CA certificate onto /etc/ssl/certs for the sake of loading as a system CA certificate. 2

The CA cert injection code path is used for three workloads: harvester, rancher, and longhorn's backing-image-data-source. Only Harvester is under our controlled, so I am going to propose a possible solution, similar to how harvester-webhook patches its own root CA 3.

  1. When the setting additional-ca has changed, harvester runs syncAdditionalTrustedCAs to update backup-target and restart itself. We can change this part to patch HTTP transport instead.
  2. Remove the current implementation, which ejects a volume to the pod after a restart.

As a side note, if a harvester cluster has 3 or more nodes, I guess it wouldn't have a downtime as such. This downtime issue might only occur in a non-HA cluster. Will verify my assumption later on.

Footnotes

  1. https://github.com/harvester/harvester/blob/e2985453c65193b407d47d53c3c96a249e2cf474/pkg/controller/master/setting/additional_ca.go#L12-L28

  2. https://github.com/harvester/harvester/blob/e2985453c65193b407d47d53c3c96a249e2cf474/pkg/webhook/resources/pod/mutator.go#L178-L183

  3. https://github.com/harvester/harvester/blob/e2985453c65193b407d47d53c3c96a249e2cf474/pkg/webhook/resources/setting/validator.go#L294-L303

@weihanglo
Copy link
Contributor

As a side note, if a harvester cluster has 3 or more nodes, I guess it wouldn't have a downtime as such. This downtime issue might only occur in a non-HA cluster. Will verify my assumption later on.

Verified that in a 3-node cluster, the high availability can prevent request from failure with 503 error.

The current backend code (harvester-webhook) only adds addition-ca to root CA for its HTTPS transport

Just FYI, Consul and Nomad also use similar trick, which reloads/updates the tls.Config dynamically as what harvester-webhook 1 2

Footnotes

  1. https://github.com/hashicorp/consul/pull/5419

  2. https://github.com/hashicorp/nomad/pull/3479

@rebeccazzzz rebeccazzzz modified the milestones: v1.0.2, v1.1.0 May 5, 2022
@guangbochen guangbochen self-assigned this Aug 12, 2022
@guangbochen guangbochen modified the milestones: v1.1.0, v1.2.0 Sep 6, 2022
@guangbochen guangbochen removed this from the v1.2.0 milestone Jan 5, 2023
@guangbochen
Copy link
Contributor

Close as this behavior is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend kind/bug Issues that are defects reported by users or that we know have reached a real release priority/1 Highly recommended to fix in this release
Projects
None yet
Development

No branches or pull requests

4 participants