[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

WuJun2016 · 2022-04-28T11:42:50Z

Describe the bug

To Reproduce
Steps to reproduce the behavior:

Go to Setting page
edit backup-target
Click on the "here" link at the bottom （The page will jump to a new tab page）
You can reopen a new page and enter https://URL/v1/harvester/network.harvesterhci.io.clusternetworks （At this point the api is able to request successfully）
go to the previously opened additional-ca edit page and enter any value, click save button (Immediately go to refresh the https://URL/v1/harvester/network.harvesterhci.io.clusternetworks page, the page does not respond for a long time)

Expected behavior

Support bundle

Environment:

Harvester ISO version:
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

weihanglo · 2022-05-03T09:26:53Z

Just did a simple research. If we provide a valid CA certificate, it succeeds without any error. Otherwise it behaves as what WuJun2016 described.

I use the following commands to create a valid certificate.

openssl genrsa 2048 > ca-key.pem
openssl req -new -x509 -nodes -days 365000 -key ca-key.pem -out ca-cert.pem

The current backend code (harvester-webhook) only adds addition-ca to root CA for its HTTPS transport. Nothing seems weird to me so far. However, I observed some fun facts

Changing additional-ca causes the harvester pod terminated and restarted.
The timing of harvester pod starting seems like at the websocket subscription.
An invalid value of addtional-ca causes the pod restarting immediately, so the user experiences hanging and 503 error right after applying the change.

I wonder we can avoid the downtime if we leverage tls.Config.GetCertificate. Will do some experiments based on these findings.

weihanglo · 2022-05-04T11:07:43Z

After a deeper investigation, I found that the root cause is that harvester deployment is forced to restart when additional CA cert changes ¹. The reason to restart is that podmutator in harvester-webhook would eject and mount a volume of the certain CA certificate onto /etc/ssl/certs for the sake of loading as a system CA certificate. ²

The CA cert injection code path is used for three workloads: harvester, rancher, and longhorn's backing-image-data-source. Only Harvester is under our controlled, so I am going to propose a possible solution, similar to how harvester-webhook patches its own root CA ³.

When the setting additional-ca has changed, harvester runs syncAdditionalTrustedCAs to update backup-target and restart itself. We can change this part to patch HTTP transport instead.
Remove the current implementation, which ejects a volume to the pod after a restart.

As a side note, if a harvester cluster has 3 or more nodes, I guess it wouldn't have a downtime as such. This downtime issue might only occur in a non-HA cluster. Will verify my assumption later on.

weihanglo · 2022-05-05T08:16:26Z

As a side note, if a harvester cluster has 3 or more nodes, I guess it wouldn't have a downtime as such. This downtime issue might only occur in a non-HA cluster. Will verify my assumption later on.

Verified that in a 3-node cluster, the high availability can prevent request from failure with 503 error.

The current backend code (harvester-webhook) only adds addition-ca to root CA for its HTTPS transport

Just FYI, Consul and Nomad also use similar trick, which reloads/updates the tls.Config dynamically as what harvester-webhook ¹ ²

guangbochen · 2023-01-05T08:40:48Z

Close as this behavior is expected.

WuJun2016 added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Apr 28, 2022

guangbochen added priority/1 Highly recommended to fix in this release area/backend labels Apr 28, 2022

guangbochen added this to the v1.0.2 milestone Apr 28, 2022

weihanglo self-assigned this May 3, 2022

weihanglo mentioned this issue May 5, 2022

Add downtime notice for single-node cluster harvester/docs#146

Merged

rebeccazzzz modified the milestones: v1.0.2, v1.1.0 May 5, 2022

bk201 unassigned weihanglo Jun 9, 2022

guangbochen self-assigned this Aug 12, 2022

guangbochen modified the milestones: v1.1.0, v1.2.0 Sep 6, 2022

guangbochen removed this from the v1.2.0 milestone Jan 5, 2023

guangbochen closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

WuJun2016 commented Apr 28, 2022

weihanglo commented May 3, 2022 •

edited

weihanglo commented May 4, 2022

weihanglo commented May 5, 2022

guangbochen commented Jan 5, 2023

[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

[BUG] Requesting "additional-ca" and then requesting "network.harvesterhci.io.clusternetworks" will return a 503 error #2205

Comments

WuJun2016 commented Apr 28, 2022

weihanglo commented May 3, 2022 • edited

weihanglo commented May 4, 2022

Footnotes

weihanglo commented May 5, 2022

Footnotes

guangbochen commented Jan 5, 2023

weihanglo commented May 3, 2022 •

edited