Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN #4519

Closed
Martin-Weiss opened this issue Sep 12, 2023 · 28 comments
Assignees
Labels
area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@Martin-Weiss
Copy link

Martin-Weiss commented Sep 12, 2023

Describe the bug

The harvester upgrade gets stuck in upgrading the helmchart based deployments (i.e. rancher-monitoring-crd) and we see this error in the fleet-agent pod in cattle-fleet-local-system:

  1. upgrade POD log, looping
Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23
Sleep for 5 seconds to retry
  1. related resources
~ # kubectl get managedchart -A
NAMESPACE   NAME                   AGE
fleet-local  harvester                 439d
fleet-local  harvester-crd               439d
fleet-local  local-managed-system-upgrade-controller  439d
fleet-local  rancher-logging              318d
fleet-local  rancher-logging-crd            318d
fleet-local  rancher-monitoring            439d
fleet-local  rancher-monitoring-crd          439d

 # kubectl get bundle -A
NAMESPACE   NAME                     BUNDLEDEPLOYMENTS-READY  STATUS
fleet-local  fleet-agent-local               1/1            
fleet-local  local-managed-system-agent          1/1            
fleet-local  mcc-harvester                 1/1            
fleet-local  mcc-harvester-crd               1/1            
fleet-local  mcc-local-managed-system-upgrade-controller  1/1            
fleet-local  mcc-rancher-logging              0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-logging-crd            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring            0/1            OutOfSync(1) [Cluster fleet-local/local]
fleet-local  mcc-rancher-monitoring-crd          0/1            WaitApplied(1) [Cluster fleet-local/local]
  1. The reason is, the fleet-agent-* POD has following error log, it can't handle the sync of the related managedchart/bundles.

time="2023-09-12T10:17:20Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.34/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs"

To Reproduce
Deploy harvester 1.1.2 and add a customer provided SSL certificate that has a single DNS SAN that points to the VIP of harvester.
Then upgrade to 1.2.0

Expected behavior
Upgrade works.

Environment

  • Harvester ISO version: 1.1.2 -> 1.2.0
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): not relevant

Additional context
Root cause seems to be that fleet-agent -> rancher communication happens over the setting settings.management.cattle.io -> server-url which has https://<ip> instead of https://<fqdn>

@Martin-Weiss Martin-Weiss added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Sep 12, 2023
@Martin-Weiss
Copy link
Author

IMO we need an option during harvester installation and configuration that allows to specify the FQDN in addition to the VIP for harvester and the FQDN should be used by the fleet-agent and the SSL certificate should have the SAN for the FQDN.

@w13915984028
Copy link
Member

w13915984028 commented Sep 12, 2023

There is such an secret fleet-agent-bootstrap used in upgrade, and the apiServerURL is IP, not FQDN.

rancher/rancher/bin/build/charts/assets/fleet-agent/fleet-agent/templates/secret.yaml

apiVersion: v1
data:
  systemRegistrationNamespace: "{{b64enc .Values.systemRegistrationNamespace}}"
  clusterNamespace: "{{b64enc .Values.clusterNamespace}}"
  token: "{{b64enc .Values.token}}"
  apiServerURL: "{{b64enc .Values.apiServerURL}}"
  apiServerCA: "{{b64enc .Values.apiServerCA}}"
kind: Secret
metadata:
  name: fleet-agent-bootstrap

When (customized) additional-ca is used, need to unify the FQDN instead of IP.

another possible related issue: #4511

@w13915984028
Copy link
Member

w13915984028 commented Sep 12, 2023

workaround: manual update the secret fleet-agent-bootstrap and replace the fleet-agent-* POD

  1. get the base64 encode of the final apiServerURL, e.g. https://harvester.example.com
echo "https://harvester.example.com" | base64
aHR0cHM6Ly9oYXJ2ZXN0ZXIuZXhhbXBsZS5jb20K
  1. edit the apiServerURL of secret fleet-agent-bootstrap , with above base64 code

kubectl edit secret -n cattle-system fleet-agent-bootstrap

  1. replace the fleet-agent pod

kubectl delete pod ...

@himslm01
Copy link

Just for clarity, is the a.b.c.d in the command to be replaced with the fully qualified hostname, because a.b.c.d looks a bit like an IPv4 IP address?

@w13915984028
Copy link
Member

yeah, your FQDN

@himslm01
Copy link

The previous value was https://10.64.0.19 - so should the echo command actually be echo "https://fqdn | base64 ?

@w13915984028
Copy link
Member

e.g. the apiServerURL is https://harvester.example.com

echo "https://harvester.example.com" | base64
aHR0cHM6Ly9oYXJ2ZXN0ZXIuZXhhbXBsZS5jb20K

@himslm01
Copy link

himslm01 commented Sep 12, 2023

For me - having replaced the apiServerURL with the FQDN and deleting the fleet-agent pod, the new fleet-agent pod is logging this:

time="2023-09-12T14:05:48Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Unauthorized"

@w13915984028
Copy link
Member

w13915984028 commented Sep 12, 2023

@himslm01 how did you achieved following against the secret? kubectl edit secret ... or kubectl delete ... + kubectly apply/create

For me - having replaced the apiServerURL with the FQDN

please use kubectl get secret -n cattle-fleet-local-system fleet-agent-bootstrap -oyaml and then strip those base64 content, we will have a look of other fields of this object

@himslm01
Copy link

I have to admit that I was extremely lazy and I used OpenLens to edit the field (which even does the BASE64 encoding for me).

$ kubectl get secret --context harvester003 -n cattle-fleet-local-system fleet-agent-bootstrap -oyaml
apiVersion: v1
data:
  apiServerCA: ""
  apiServerURL: aH_BASE64_aw==
  clusterNamespace: Zm_BASE64_bQ==
  token: ZX_BASE64_lE=
kind: Secret
metadata:
  annotations:
    objectset.rio.cattle.io/applied: H4_REDACTED_AA
    objectset.rio.cattle.io/id: fleet-agent-bootstrap-cattle-fleet-local-system
  creationTimestamp: "2023-09-11T20:48:21Z"
  labels:
    objectset.rio.cattle.io/hash: 362023f752e7f1989d8b652e029bd2c658ae7c44
  name: fleet-agent-bootstrap
  namespace: cattle-fleet-local-system
  resourceVersion: "910927838"
  uid: 792de82d-c498-45d4-939f-fb0743aeb434
type: Opaque

@w13915984028
Copy link
Member

w13915984028 commented Sep 12, 2023

@himslm01

are those fields pre-processed by you (guess so) ? e.g., the apiServerURL value aH_BASE64_aw== is not a valid base64 code

you may try to use below command to decode the base64 of apiServerURL, clusterNamespace, token against you local data (e.g., the kubectl output), to double confirm they can be correctly decoded and the values are expected.

echo "aH_BASE64_aw==" | base64 -d
hbase64: invalid input

by contrast:

echo "aHR0cHM6Ly9oYXJ2ZXN0ZXIuZXhhbXBsZS5jb20K" | base64 -d
https://harvester.example.com

part of your listed data:

data:
  apiServerCA: ""
  apiServerURL: aH_BASE64_aw==
  clusterNamespace: Zm_BASE64_bQ==
  token: ZX_BASE64_lE=

The Unauthorized error code, maybe due to token field, needs more investigation.

@w13915984028
Copy link
Member

@Martin-Weiss please also help take a look the workaroud, do we miss some steps ? thanks.

@himslm01
Copy link

himslm01 commented Sep 12, 2023

Hi @w13915984028,

You asked me to "use kubectl get secret -n cattle-fleet-local-system fleet-agent-bootstrap -oyaml and then strip those base64 content" so I did, replacing the base64 data with _BASE64_. Sorry I didn't make that 100% clear :-)

I didn't know how sensitive the token data was, so I stripped that too.

@himslm01
Copy link

Sorry - I've been out for the evening.

Replacing the FQDN with the IP address in the fleet-agent-bootstrap secret and deleting the fleet-agent-XXXX pod causes the tls: failed to verify certificate: x509: cannot validate certificate for 192.168.0.34 because it doesn't contain any IP SANs error to return, so the issue is not the way I'm editing the secret.

Is there any correlation between the apiServerURL and the data in the token?

I guess a bodge could be for me to remove the LetsEncrypt certificate from the harvesterhci.io/Setting ssl-certificates during the upgrade, but that seems ugly.

@Martin-Weiss
Copy link
Author

@w13915984028 - in my case I just did a kubectl edit on the fleet-agent-bootstrap secret and replaced the apiServerURL with the base64 encoded https://.

@himslm01 - as long as your custom SSL certificate does not include the IP as SAN the error (doesn't contain any IP SANs) will not go away. So I guess you have to change this field in the secret with the base64 encoded URL.. at least in my case this helped.. If you run into the next error "Failed to register agent" this seems to be another issue where i.e. the token in the fleet-agent-bootstrap secret got invalid..

@w13915984028 - maybe setting the server-url in the rancher settings might help as well - but did not have a chance to test this..

@w13915984028
Copy link
Member

A short summary of this issue and later actions/results:

#4525 (comment)

@himslm01
Copy link

himslm01 commented Sep 13, 2023

In my case editing the fleet-agent-bootstrap secret and replaced the apiServerURL with the base64 encoded https://FQDN (which was previously an IP address) and deleting the fleet-agent-XXXX pod only gets me from the the fleet-agent logging the message:

Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://10.64.0.19/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 10.64.0.19 because it doesn't contain any IP SANs

to it logging the message:

Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Unauthorized

I don't have any clue how to fix this state.

@Martin-Weiss
Copy link
Author


Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Unauthorized

I don't have any clue how to fix this state.

I believe this is a fleet registration issue where the token in the secret might not be valid, anymore... if I remember right a force update on the fleet-cluster could help - but I have no idea how to run that with embedded Rancher in Harvester.. maybe someone else?

@slackspace-io
Copy link

slackspace-io commented Sep 15, 2023

Force update had no change for me.

For me, I am able to change with in the UI by entering the DNS - no need to base64. If you use kubectl edit, you need base64

On my setup, harvester is harvester.my.domain and rancher(addon) is rancher.my.domain

If I enter https://harvester.my.domain in the secret it will work after redeploying the fleet-agent pod
If I enter https://rancher.my.domain then I get the unauthorized. Doing force update does not fix this
If I delete the secret, it says it cannot find the secret.

Certain activities will retrigger this bootstrap scenario - I think redeploy fleet-controller. Also, for me with ranchver vcluster enabled the namespace is different but same symptoms. The IP is used instead of the DNS VIP that was entered during install.

The pod will also happily be 'running' with this error. Meaning only if you have a reason to suspect an issue due to fleet-agent not functioning would you go 'look for' and find this. Should such a behaviour trigger a crash/restart or some other way to indicate unhealthy?

I am not saying it should work with a url other than the VIP of harvester(per install time), just sharing the observation incase this helps you @himslm01 . Try (or confirm) using the exact vip entered within your 90_custom on the harvester nodes itself. Not a DNS which resolves to the same IP. If you still get unauthorized, no idea and sorry for the bad idea :D

@noahgildersleeve
Copy link

I was able to repro this, but here's what happened with the fixes. This also includes some proposed fixes from #4517 . This was an upgrade scenario with 2 nodes deployed with ipxe-examples and a custom DNS that I'm running in my homelab.

  • This was done with a public cert and IP and DNS SAN (DNS set to hostname and IP set to the VIP of Harvester). Cert was created with this command

    • openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes -keyout harvester.key -out harvester.crt -subj "/CN=harvester-master.test" -addext "subjectAltName=DNS:192.168.0.131,DNS:harvester.test,IP:192.168.0.131"
  • With base64 encoded https://harvester.test as the apiserver

    • time="2023-09-16T17:44:39Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://harvester.test/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": dial tcp: lookup harvester.test on 10.53.0.10:53: no such host"
  • with base64 encoded https://192.168.0.131 as the apiserver

    • time="2023-09-16T17:47:17Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.0.131/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
  • When trying to put in the apiserverCA with the base64 string encoded with this kubectl get settings.management.cattle.io -n cattle-system internal-cacerts -o jsonpath="{.value}" | base64 it's giving me errors when trying to save it. I'm getting error: secrets "fleet-agent-bootstrap" is invalid

@bk201 bk201 added this to the v1.2.1 milestone Sep 20, 2023
@w13915984028
Copy link
Member

How the issue is reproduced and fixed via a workaround:

reproduce steps:

(0) Set up a 2-node Harvester v1.1.2 cluster, the VIP related DNS is : harv31.example.com

(1) create certificate for (*).example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
  -keyout example.key -out example.crt -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:*.example.com"

$ those 2 files are generated
example.crt  example.key

(2) Configure `ssl-certificate` via Harvester UI, both `public ceritificate` and `ca` point to file `example.crt`, `private key` points to `example.key`

(3) Add DNS record, e.g:  192.168.122.199	harv31.example.com

(4) Upgrade to v120, the issue is produced, upgrade stucks in POD `hvst-upgrade-vq4hl-apply-manifests-*`

fleet-agent log:

time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.122.199/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs"

upgrade pod status

# kubectl get pods -A | grep upgrade
cattle-system               system-upgrade-controller-5685d568ff-tkvxb                 1/1     Running     0              85m
harvester-system            hvst-upgrade-vq4hl-apply-manifests-65vv8                   1/1     Running     0              87m  // waiting for managedchart to be ready
..


bundle status:

# kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
..                    
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging                           0/1                       OutOfSync(1) [Cluster fleet-local/local]
fleet-local   mcc-rancher-logging-crd                       0/1                       OutOfSync(1) [Cluster fleet-local/local]
fleet-local   mcc-rancher-monitoring                        0/1                       OutOfSync(1) [Cluster fleet-local/local]
fleet-local   mcc-rancher-monitoring-crd                    0/1                       WaitApplied(1) [Cluster fleet-local/local]
harv31:/home/rancher # 


workaround:

(5) check server-url setting

 # kubectl get settings.management.cattle.io -n cattle-system server-url
NAME         VALUE
server-url   https://192.168.122.199

(6) check secret `fleet-agent-bootstrap`, the `data.apiServerCA` should be the base64 value of previous `example.crt`, if not, update it

# kubectl get secret -A | grep fleet
cattle-fleet-local-system                fleet-agent-bootstrap                                                Opaque                                        5      83m

 # kubectl get secret -n cattle-fleet-local-system fleet-agent-bootstrap -oyaml
apiVersion: v1
data:
  apiServerCA: ...
  apiServerURL: aHR0cHM6Ly8xOTIuMTY4LjEyMi4xOTk=
  clusterNamespace: ZmxlZXQtbG9jYWw=
  systemRegistrationNamespace: Y2F0dGxlLWZsZWV0LWNsdXN0ZXJzLXN5c3RlbQ==
  token: ...
kind: Secret
metadata:
..
  name: fleet-agent-bootstrap
  namespace: cattle-fleet-local-system
  resourceVersion: "75235"
  uid: 975bd478-b357-4240-aeb4-eaff4626909f
type: Opaque

(7) update `server-url` with the value of `https://harv31.example.com`
 # kubectl edit settings.management.cattle.io -n cattle-system server-url
setting.management.cattle.io/server-url edited
..
 # kubectl get settings.management.cattle.io -n cattle-system server-url
NAME         VALUE
server-url   https://harv31.example.com

(8) the `fleet-agent-pod*` will be replaced by Rancher automatically and continue the upgrade

@w13915984028
Copy link
Member

An ongoing fix: #4543 to solve this issue.

@w13915984028 w13915984028 moved this from New to In progress in Community Issue Review Sep 20, 2023
@bk201
Copy link
Member

bk201 commented Sep 20, 2023

@w13915984028 The change in #4543 works, but we still need to address the fleet apiServerURL issue, here is the fleet-controller after upgrade:

time="2023-09-20T10:19:24Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
time="2023-09-20T10:19:24Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:24Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"
time="2023-09-20T10:19:26Z" level=error msg="error syncing 'fleet-local/local': handler import-cluster: missing apiServerURL in fleet config for cluster auto registration, requeuing"

The fleet-agent functions well because it has the correct setting (the fleet-controller doesn't re-deploy it yet). We need fix the fleet-controller part.

@w13915984028
Copy link
Member

All enhancements are summarized to EPIC #4553

@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Sep 28, 2023

Pre Ready-For-Testing Checklist

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

  • Test plan:

(0) Set up a N-node (N>=2) Harvester v1.1.2 cluster, assume the VIP related DNS is : harv31.example.com

(1) create certificate for (*).example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
  -keyout example.key -out example.crt -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:*.example.com"

$ those 2 files are generated
example.crt  example.key

(2) Configure `ssl-certificate` via Harvester UI, both `public ceritificate` and `ca` point to file `example.crt`, `private key` points to `example.key`

(3) Add DNS record, e.g:  192.168.122.199	harv31.example.com

(4.a) Upgrade to v120, without the fix, the issue is produced, upgrade stucks in POD `hvst-upgrade-vq4hl-apply-manifests-*`
fleet-agent error log:
time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: 
Post \"https://192.168.122.199/apis/fleet.cattle.io/v1alpha1/namespaces/fleet-local/clusterregistrations\": 
tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs"

(4.b) Upgrade to v120, with the fix, the upgrade step of apply-manifests will be success, it continues.  (When encounter other issue, please refer to those known issue list and related workarounds)

@harvesterhci-io-github-bot

Automation e2e test issue: harvester/tests#951

@bk201 bk201 added require/doc Improvements or additions to documentation priority/0 Must be fixed in this release severity/1 Function broken (a critical incident with very high impact) reproduce/often Reproducible 10% to 99% of the time and removed reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Oct 4, 2023
@bk201
Copy link
Member

bk201 commented Oct 11, 2023

@w13915984028 we need a test plan in the comment before moving to ready-for-testing (either the full plan or link).
And also a workaround doc in the known issue section: https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0

@noahgildersleeve
Copy link

noahgildersleeve commented Oct 13, 2023

Tested with upgrade v1.1.2 -> v1.2.0 with 2 node bare metal servers. I was having some issues but they were related to the CA not being entered in when adding all three parts in the ssl-certs settings. I added the CA afterwards to additional-ca and they worked fine

Test Plan

  1. Create DNS entry for cluster
  2. Deploy 2 node server with DNS set to the server that has the entry
  3. Add ssl-certs for public, private, and CA
  4. Add CA to additional-ca
  5. add server-url to underlying rancher via https://v112-upgrade.localnet/dashboard/c/local/settings/management.cattle.io.setting
  6. apply kubectl annotate setting.management server-url harvesterhci.io/patched-by-controller=true
  7. Create image, network, and VM before upgrade
  8. Run airgap upgrade to v1.2.0 on 2 node ipxe-examples setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Projects
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

8 participants