New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN #4519
Comments
IMO we need an option during harvester installation and configuration that allows to specify the FQDN in addition to the VIP for harvester and the FQDN should be used by the fleet-agent and the SSL certificate should have the SAN for the FQDN. |
There is such an secret rancher/rancher/bin/build/charts/assets/fleet-agent/fleet-agent/templates/secret.yaml
When (customized) additional-ca is used, need to unify the another possible related issue: #4511 |
workaround: manual update the secret
|
Just for clarity, is the |
yeah, your FQDN |
The previous value was |
e.g. the
|
For me - having replaced the
|
@himslm01 how did you achieved following against the secret?
please use |
I have to admit that I was extremely lazy and I used OpenLens to edit the field (which even does the BASE64 encoding for me).
|
are those fields pre-processed by you (guess so) ? e.g., the you may try to use below command to decode the base64 of
by contrast:
part of your listed data:
The |
@Martin-Weiss please also help take a look the workaroud, do we miss some steps ? thanks. |
Hi @w13915984028, You asked me to "use I didn't know how sensitive the token data was, so I stripped that too. |
Sorry - I've been out for the evening. Replacing the FQDN with the IP address in the Is there any correlation between the I guess a bodge could be for me to remove the LetsEncrypt certificate from the |
@w13915984028 - in my case I just did a kubectl edit on the fleet-agent-bootstrap secret and replaced the apiServerURL with the base64 encoded https://. @himslm01 - as long as your custom SSL certificate does not include the IP as SAN the error (doesn't contain any IP SANs) will not go away. So I guess you have to change this field in the secret with the base64 encoded URL.. at least in my case this helped.. If you run into the next error "Failed to register agent" this seems to be another issue where i.e. the token in the fleet-agent-bootstrap secret got invalid.. @w13915984028 - maybe setting the server-url in the rancher settings might help as well - but did not have a chance to test this.. |
A short summary of this issue and later actions/results: |
In my case editing the fleet-agent-bootstrap secret and replaced the apiServerURL with the base64 encoded https://FQDN (which was previously an IP address) and deleting the fleet-agent-XXXX pod only gets me from the the fleet-agent logging the message:
to it logging the message:
I don't have any clue how to fix this state. |
I believe this is a fleet registration issue where the token in the secret might not be valid, anymore... if I remember right a force update on the fleet-cluster could help - but I have no idea how to run that with embedded Rancher in Harvester.. maybe someone else? |
Force update had no change for me. For me, I am able to change with in the UI by entering the DNS - no need to base64. If you use kubectl edit, you need base64 On my setup, harvester is harvester.my.domain and rancher(addon) is rancher.my.domain If I enter https://harvester.my.domain in the secret it will work after redeploying the fleet-agent pod Certain activities will retrigger this bootstrap scenario - I think redeploy fleet-controller. Also, for me with ranchver vcluster enabled the namespace is different but same symptoms. The IP is used instead of the DNS VIP that was entered during install. The pod will also happily be 'running' with this error. Meaning only if you have a reason to suspect an issue due to fleet-agent not functioning would you go 'look for' and find this. Should such a behaviour trigger a crash/restart or some other way to indicate unhealthy? I am not saying it should work with a url other than the VIP of harvester(per install time), just sharing the observation incase this helps you @himslm01 . Try (or confirm) using the exact vip entered within your 90_custom on the harvester nodes itself. Not a DNS which resolves to the same IP. If you still get unauthorized, no idea and sorry for the bad idea :D |
I was able to repro this, but here's what happened with the fixes. This also includes some proposed fixes from #4517 . This was an upgrade scenario with 2 nodes deployed with ipxe-examples and a custom DNS that I'm running in my homelab.
|
How the issue is reproduced and fixed via a workaround:
|
An ongoing fix: #4543 to solve this issue. |
@w13915984028 The change in #4543 works, but we still need to address the fleet apiServerURL issue, here is the fleet-controller after upgrade:
The fleet-agent functions well because it has the correct setting (the fleet-controller doesn't re-deploy it yet). We need fix the fleet-controller part. |
All enhancements are summarized to EPIC #4553 |
Pre Ready-For-Testing Checklist
|
Automation e2e test issue: harvester/tests#951 |
@w13915984028 we need a test plan in the comment before moving to ready-for-testing (either the full plan or link). |
Tested with upgrade v1.1.2 -> v1.2.0 with 2 node bare metal servers. I was having some issues but they were related to the CA not being entered in when adding all three parts in the ssl-certs settings. I added the CA afterwards to Test Plan
|
Describe the bug
The harvester upgrade gets stuck in upgrading the helmchart based deployments (i.e. rancher-monitoring-crd) and we see this error in the fleet-agent pod in cattle-fleet-local-system:
fleet-agent-*
POD has following error log, it can't handle the sync of the related managedchart/bundles.time="2023-09-12T10:17:20Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs"
To Reproduce
Deploy harvester 1.1.2 and add a customer provided SSL certificate that has a single DNS SAN that points to the VIP of harvester.
Then upgrade to 1.2.0
Expected behavior
Upgrade works.
Environment
Additional context
Root cause seems to be that fleet-agent -> rancher communication happens over the setting
settings.management.cattle.io
->server-url
which hashttps://<ip>
instead ofhttps://<fqdn>
The text was updated successfully, but these errors were encountered: