diff --git a/docs/upgrade/v1-1-2-to-v1-2-0.md b/docs/upgrade/v1-1-2-to-v1-2-0.md index 5bc84422a81..14eb30493af 100644 --- a/docs/upgrade/v1-1-2-to-v1-2-0.md +++ b/docs/upgrade/v1-1-2-to-v1-2-0.md @@ -298,7 +298,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for 1. Check if the `prometheus-rancher-monitoring-prometheus-0` pod is stuck with the status `Terminating`. ``` - $ kubectl -n cattle-monitoring-system get pods + $ kubectl -n cattle-monitoring-system get pods NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 0/3 Terminating 0 19d ``` @@ -399,7 +399,7 @@ If an upgrade is stuck in an `Upgrading System Service` state for an extended pe --- -### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in airgapped environment +### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in air-gapped environment Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` image in the ISO file as of v1.1.0. For Windows VMs before v1.1.0, they used this image as a container disk. However, kubelet may remove old images to free up bytes. Windows VMs can't access an air-gapped environment when this image is removed. You can fix this issue by changing the image to `registry.suse.com/suse/vmdp/vmdp:2.5.4.2` and restarting the Windows VMs. @@ -407,7 +407,8 @@ Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` im - [[BUG] VMDP Image wrong after upgrade to Harvester 1.2.0](https://github.com/harvester/harvester/issues/4534) --- -### 9. Upgrade stuck in the Post-draining state + +### 9. An Upgrade is stuck in the Post-draining state The node might be stuck in the OS upgrade process if you encounter the **Post-draining** state, as shown below. @@ -483,3 +484,111 @@ After performing the steps above, you should pass post-draining with the next re - [A potential bug in NewElementalPartitionsFromList which caused upgrade error code 33](https://github.com/rancher/elemental-toolkit/issues/1827) - Workaround: - https://github.com/harvester/harvester/issues/4526#issuecomment-1732853216 + +--- + +### 10. An upgrade is stuck in the Upgrading System Service state due to the `customer provided SSL certificate without IP SAN` error in `fleet-agent` + +If an upgrade is stuck in an **Upgrading System Service** state for an extended period, follow these steps to investigate this issue: + +1. Find the pods related to the upgrade: + + ``` + kubectl get pods -A | grep upgrade + ``` + + Example output: + + ``` + # kubectl get pods -A | grep upgrade + cattle-system system-upgrade-controller-5685d568ff-tkvxb 1/1 Running 0 85m + harvester-system hvst-upgrade-vq4hl-apply-manifests-65vv8 1/1 Running 0 87m // waiting for managedchart to be ready + .. + ``` + +2. The pod `hvst-upgrade-vq4hl-apply-manifests-65vv8` has the following loop log: + + ``` + Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23 + Sleep for 5 seconds to retry + ``` + +3. Check the status for all bundles. Note thata couple of bundles are `OutOfSync`: + + ``` + # kubectl get bundle -A + NAMESPACE NAME BUNDLEDEPLOYMENTS-READY STATUS + ... + fleet-local mcc-local-managed-system-upgrade-controller 1/1 + fleet-local mcc-rancher-logging 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-logging-crd 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-monitoring 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-monitoring-crd 0/1 WaitApplied(1) [Cluster fleet-local/local] + ``` + +4. The pod `fleet-agent-*` has following error log: + + ``` + fleet-agent pod log: + + time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.122.199/apis/fleet.cattle.io/ v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs" + ``` + +5. Check the `ssl-certificates` settings in Harvester: + + From the command line: + + ``` + # kubectl get settings.harvesterhci.io ssl-certificates + NAME VALUE + ssl-certificates {"publicCertificate":"-----BEGIN CERTIFICATE-----\nMIIFNDCCAxygAwIBAgIUS7DoHthR/IR30+H/P0pv6HlfOZUwDQYJKoZIhvcNAQEL\nBQAwFjEUMBIGA1UEAwwLZXhhbXBsZS5j...."} + ``` + + From the Harvester Web UI: + + ![](/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png) + +6. Check the `server-url` setting, it is the value of VIP: + + ``` + # kubectl get settings.management.cattle.io -n cattle-system server-url + NAME VALUE + server-url https://192.168.122.199 + ``` + +7. The root cause: + + User sets the self-signed `ssl-certificates` with FQDN in the Harvester settings, but the `server-url` points to the VIP, the `fleet-agent` pod fails to register. + + ``` + For example: create self-signed certificate for (*).example.com + + openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \ + -keyout example.key -out example.crt -subj "/CN=example.com" \ + -addext "subjectAltName=DNS:example.com,DNS:*.example.com" + + The general outputs are: example.crt, example.key + ``` + +8. The workaround: + + Update `server-url` with the value of `https://harv31.example.com` + + ``` + # kubectl edit settings.management.cattle.io -n cattle-system server-url + setting.management.cattle.io/server-url edited + ... + + # kubectl get settings.management.cattle.io -n cattle-system server-url + NAME VALUE + server-url https://harv31.example.com + ``` + + After the workaround is applied, the `fleet-agent` pod is replaced by Rancher automatically and registers successfully, the upgrade continues. + +- Related issue: + - [[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN](https://github.com/harvester/harvester/issues/4519) +- Workaround: + - https://github.com/harvester/harvester/issues/4519#issuecomment-1727132383 + +--- diff --git a/static/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png b/static/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png new file mode 100644 index 00000000000..184a1a6736d Binary files /dev/null and b/static/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png differ diff --git a/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md b/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md index d9663124035..66c488f4dfc 100644 --- a/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md +++ b/versioned_docs/version-v1.2/upgrade/v1-1-2-to-v1-2-0.md @@ -298,7 +298,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for 1. Check if the `prometheus-rancher-monitoring-prometheus-0` pod is stuck with the status `Terminating`. ``` - $ kubectl -n cattle-monitoring-system get pods + $ kubectl -n cattle-monitoring-system get pods NAME READY STATUS RESTARTS AGE prometheus-rancher-monitoring-prometheus-0 0/3 Terminating 0 19d ``` @@ -330,7 +330,7 @@ If you notice the upgrade is stuck in the **Upgrading System Service** state for --- -### 7. Upgrade stuck in the `Upgrading System Service` state +### 7. An upgrade is stuck in the `Upgrading System Service` state If an upgrade is stuck in an `Upgrading System Service` state for an extended period, some system services' certificates may have expired. To investigate and resolve this issue, follow these steps: @@ -399,7 +399,7 @@ If an upgrade is stuck in an `Upgrading System Service` state for an extended pe --- -### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in airgapped environment +### 8. The `registry.suse.com/harvester-beta/vmdp:latest` image is not available in air-gapped environment Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` image in the ISO file as of v1.1.0. For Windows VMs before v1.1.0, they used this image as a container disk. However, kubelet may remove old images to free up bytes. Windows VMs can't access an air-gapped environment when this image is removed. You can fix this issue by changing the image to `registry.suse.com/suse/vmdp/vmdp:2.5.4.2` and restarting the Windows VMs. @@ -407,7 +407,8 @@ Harvester does not package the `registry.suse.com/harvester-beta/vmdp:latest` im - [[BUG] VMDP Image wrong after upgrade to Harvester 1.2.0](https://github.com/harvester/harvester/issues/4534) --- -### 9. Upgrade stuck in the Post-draining state + +### 9. An Upgrade is stuck in the Post-draining state The node might be stuck in the OS upgrade process if you encounter the **Post-draining** state, as shown below. @@ -484,3 +485,110 @@ After performing the steps above, you should pass post-draining with the next re - Workaround: - https://github.com/harvester/harvester/issues/4526#issuecomment-1732853216 +--- + +### 10. An upgrade is stuck in the Upgrading System Service state due to the `customer provided SSL certificate without IP SAN` error in `fleet-agent` + +If an upgrade is stuck in an **Upgrading System Service** state for an extended period, follow these steps to investigate this issue: + +1. Find the pods related to the upgrade: + + ``` + kubectl get pods -A | grep upgrade + ``` + + Example output: + + ``` + # kubectl get pods -A | grep upgrade + cattle-system system-upgrade-controller-5685d568ff-tkvxb 1/1 Running 0 85m + harvester-system hvst-upgrade-vq4hl-apply-manifests-65vv8 1/1 Running 0 87m // waiting for managedchart to be ready + .. + ``` + +2. The pod `hvst-upgrade-vq4hl-apply-manifests-65vv8` has the following loop log: + + ``` + Current version: 102.0.0+up40.1.2, Current state: WaitApplied, Current generation: 23 + Sleep for 5 seconds to retry + ``` + +3. Check the status for all bundles. Note thata couple of bundles are `OutOfSync`: + + ``` + # kubectl get bundle -A + NAMESPACE NAME BUNDLEDEPLOYMENTS-READY STATUS + ... + fleet-local mcc-local-managed-system-upgrade-controller 1/1 + fleet-local mcc-rancher-logging 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-logging-crd 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-monitoring 0/1 OutOfSync(1) [Cluster fleet-local/local] + fleet-local mcc-rancher-monitoring-crd 0/1 WaitApplied(1) [Cluster fleet-local/local] + ``` + +4. The pod `fleet-agent-*` has following error log: + + ``` + fleet-agent pod log: + + time="2023-09-19T12:18:10Z" level=error msg="Failed to register agent: looking up secret cattle-fleet-local-system/fleet-agent-bootstrap: Post \"https://192.168.122.199/apis/fleet.cattle.io/ v1alpha1/namespaces/fleet-local/clusterregistrations\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.122.199 because it doesn't contain any IP SANs" + ``` + +5. Check the `ssl-certificates` settings in Harvester: + + From the command line: + + ``` + # kubectl get settings.harvesterhci.io ssl-certificates + NAME VALUE + ssl-certificates {"publicCertificate":"-----BEGIN CERTIFICATE-----\nMIIFNDCCAxygAwIBAgIUS7DoHthR/IR30+H/P0pv6HlfOZUwDQYJKoZIhvcNAQEL\nBQAwFjEUMBIGA1UEAwwLZXhhbXBsZS5j...."} + ``` + + From the Harvester Web UI: + + ![](/img/v1.2/upgrade/known_issues/4519-harvester-settings-ssl-certificates.png) + +6. Check the `server-url` setting, it is the value of VIP: + + ``` + # kubectl get settings.management.cattle.io -n cattle-system server-url + NAME VALUE + server-url https://192.168.122.199 + ``` + +7. The root cause: + + User sets the self-signed `ssl-certificates` with FQDN in the Harvester settings, but the `server-url` points to the VIP, the `fleet-agent` pod fails to register. + + ``` + For example: create self-signed certificate for (*).example.com + + openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \ + -keyout example.key -out example.crt -subj "/CN=example.com" \ + -addext "subjectAltName=DNS:example.com,DNS:*.example.com" + + The general outputs are: example.crt, example.key + ``` + +8. The workaround: + + Update `server-url` with the value of `https://harv31.example.com` + + ``` + # kubectl edit settings.management.cattle.io -n cattle-system server-url + setting.management.cattle.io/server-url edited + ... + + # kubectl get settings.management.cattle.io -n cattle-system server-url + NAME VALUE + server-url https://harv31.example.com + ``` + + After the workaround is applied, the `fleet-agent` pod is replaced by Rancher automatically and registers successfully, the upgrade continues. + +- Related issue: + - [[BUG] Upgrade to Harvester 1.2.0 fails in fleet-agent due to customer provided SSL certificate without IP SAN](https://github.com/harvester/harvester/issues/4519) +- Workaround: + - https://github.com/harvester/harvester/issues/4519#issuecomment-1727132383 + +---