[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

lanfon72 · 2024-04-26T07:53:14Z

Describe the bug

Upgrade pending on Upgradeing System Service, when check the log of pod apply-manifests, it shows

Waiting for plan hvst-upgrade-fm29b-skip-restart-rancher-system-agent to complete...

To Reproduce

Steps to reproduce the behavior:

Install Harvester with any nodes
Create an image for VM creation
Create Cluster network and VM network
Setup backup-target
Create new storageclass sc1 with 3 replicas and set it as default
Create VM vm1 with VM network's VLAN
Write data into vm1 then take backup vm1b
Restore vm1b into vm2
Create VM vm3 with additional volume
Perform upgrade

Expected behavior

upgrade should success

Environment:

Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Baremetal DL360 3 nodes
Harvester ISO version: v1.2.1 -> v1.2.2-rc1
ui-source Option: Auto

Additional context

The upgrade is performed by test automation (locally)

Support bundle

supportbundle_229d2fc0-5c52-42e2-a10e-9cd6a437adb1_2024-04-26T07-43-07Z.zip

The text was updated successfully, but these errors were encountered:

harvesterhci-io-github-bot · 2024-04-29T14:20:31Z

albinsun · 2024-05-08T10:35:22Z

Hit in a 3 nodes upgrade test from v1.2.1 -> v1.2.2-rc2.
Will test again after RC3 release.

w13915984028 · 2024-05-08T13:30:00Z

The plan object hvst-upgrade-fm29b-skip-restart-rancher-system-agent has no status field from the support-bundle; but other plan objects have.

- apiVersion: upgrade.cattle.io/v1
  kind: Plan
  metadata:
    creationTimestamp: "2024-04-25T20:55:07Z"
    generation: 1
    managedFields:
    - apiVersion: upgrade.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:concurrency: {}
          f:nodeSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:harvesterhci.io/managed: {}
          f:serviceAccountName: {}
          f:tolerations: {}
          f:upgrade:
            .: {}
            f:args: {}
            f:command: {}
            f:image: {}
          f:version: {}
      manager: kubectl-create
      operation: Update
      time: "2024-04-25T20:55:07Z"
    name: hvst-upgrade-fm29b-skip-restart-rancher-system-agent
    namespace: cattle-system
    resourceVersion: "363026"
    uid: 245884a7-fc5f-426f-a7ad-ce87c9f8ed79
  spec:
    concurrency: 10
    nodeSelector:
      matchLabels:
        harvesterhci.io/managed: "true"
    serviceAccountName: system-upgrade-controller
    tolerations:
    - operator: Exists
    upgrade:
      args:
      - sh
      - -c
      - set -x && mkdir -p /run/systemd/system/rancher-system-agent.service.d && echo
        -e '[Service]\nEnvironmentFile=-/run/systemd/system/rancher-system-agent.service.d/10-harvester-upgrade.env'
        | tee /run/systemd/system/rancher-system-agent.service.d/override.conf &&
        echo 'INSTALL_RKE2_SKIP_ENABLE=true' | tee /run/systemd/system/rancher-system-agent.service.d/10-harvester-upgrade.env
        && systemctl daemon-reload && systemctl restart rancher-system-agent.service
      command:
      - chroot
      - /host
      image: registry.suse.com/bci/bci-base:15.4
    version: 72c4fe8c

w13915984028 · 2024-05-08T13:44:07Z

The system-upgrade-controller seems to have issue from 2024-04-25T20:51:55

rancher-79d47f56-kjzkl/rancher.log

2024-04-25T20:51:55.627240258Z 2024/04/25 20:51:55 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
2024-04-25T20:51:55.758930047Z 2024/04/25 20:51:55 [INFO] [planner] rkecluster fleet-local/local: configuring bootstrap node(s) custom-5c1eb391c3d4: waiting for plan to be applied
2024-04-25T20:51:55.940208473Z 2024/04/25 20:51:55 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
2024-04-25T20:51:56.080572880Z 2024/04/25 20:51:56 [INFO] [planner] rkecluster fleet-local/local: configuring bootstrap node(s) custom-5c1eb391c3d4: waiting for plan to be applied
2024-04-25T20:51:56.083257126Z 2024/04/25 20:51:56 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
...

w13915984028 · 2024-05-08T13:53:18Z

The provisioning.cattle.io/v1 cluster ojbect fleet-local/local object has no status
fleetWorkspaceName: fleet-local (refer the output in another thread: #5718 (comment))

apiVersion: v1
items:
- apiVersion: provisioning.cattle.io/v1
  kind: Cluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","metadata":{"annotations":{},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.25.9+rke2r1","rkeConfig":{"controlPlaneConfig":{"disable":["rke2-snapshot-controller","rke2-snapshot-controller-crd","rke2-snapshot-validation-webhook"]}}}}
      objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
      objectset.rio.cattle.io/id: provisioning-cluster-create
      objectset.rio.cattle.io/owner-gvk: management.cattle.io/v3, Kind=Cluster
      objectset.rio.cattle.io/owner-name: local
      objectset.rio.cattle.io/owner-namespace: "null"
    creationTimestamp: "2024-04-25T14:02:00Z"
    finalizers:
    - wrangler.cattle.io/provisioning-cluster-remove
    - wrangler.cattle.io/rke-cluster-remove
    generation: 2
    labels:
      objectset.rio.cattle.io/hash: 50675339e9ca48d1b72932eb038d75d9d2d44618
      provider.cattle.io: harvester
    managedFields:
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:objectset.rio.cattle.io/applied: {}
            f:objectset.rio.cattle.io/id: {}
            f:objectset.rio.cattle.io/owner-gvk: {}
            f:objectset.rio.cattle.io/owner-name: {}
            f:objectset.rio.cattle.io/owner-namespace: {}
          f:finalizers:
            .: {}
            v:"wrangler.cattle.io/provisioning-cluster-remove"null: {}
            v:"wrangler.cattle.io/rke-cluster-remove"null: {}
          f:labels:
            .: {}
            f:objectset.rio.cattle.io/hash: {}
            f:provider.cattle.io: {}
        f:spec:
          .: {}
          f:localClusterAuthEndpoint: {}
      manager: rancher
      operation: Update
      time: "2024-04-25T14:02:00Z"
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          f:kubernetesVersion: {}
          f:rkeConfig: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: "2024-04-25T14:03:35Z"
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:clientSecretName: {}
          f:clusterName: {}
          f:conditions: {}
          f:observedGeneration: {}
          f:ready: {}
      manager: rancher
      operation: Update
      subresource: status
      time: "2024-04-25T20:55:00Z"
    name: local
    namespace: fleet-local
    resourceVersion: "362895"
    uid: 0d2d4ee9-3158-4651-9025-e7a009e84b0d
  spec:
    kubernetesVersion: v1.25.9+rke2r1
    localClusterAuthEndpoint: {}
    rkeConfig: {}
  status:
    clientSecretName: local-kubeconfig
    clusterName: local
    conditions:
    - lastUpdateTime: "2024-04-25T14:04:38Z"
      message: marking control plane as initialized and ready
      reason: Waiting
      status: Unknown
      type: Ready
    - lastUpdateTime: "2024-04-25T14:02:00Z"
      status: "False"
      type: Reconciling
    - lastUpdateTime: "2024-04-25T14:02:00Z"
      status: "False"
      type: Stalled
    - lastUpdateTime: "2024-04-25T14:03:35Z"
      status: "True"
      type: Created
    - lastUpdateTime: "2024-04-25T20:55:00Z"
      status: "True"
      type: RKECluster
    - status: Unknown
      type: DefaultProjectCreated
    - status: Unknown
      type: SystemProjectCreated
    - lastUpdateTime: "2024-04-25T14:02:13Z"
      status: "True"
      type: Connected
    - lastUpdateTime: "2024-04-25T14:29:17Z"
      status: "True"
      type: Updated
    - lastUpdateTime: "2024-04-25T14:29:17Z"
      status: "True"
      type: Provisioned
    observedGeneration: 2
    ready: true
kind: List
metadata:
  continue: "null"
  resourceVersion: "978153"

lanfon72 · 2024-05-08T14:20:22Z

@w13915984028 could you help to check is it the same root cause of the original case?
I saw that @albinsun's cluster have RKE2 cluster running on it, but the original case haven't.

w13915984028 · 2024-05-08T14:28:58Z

I guess our lastest PRs (restart fleet-agent) may have solved this issue. As @albinsun said, #5690 (comment), please validate it on v1.2.2-RC3 release.

irishgordo · 2024-05-10T19:31:55Z

I additionally have hit this with rc2 - going to re-check with rc3

irishgordo · 2024-05-10T19:33:20Z

bumped to reproduce/often as seems theme with v1.2.2-rc2

irishgordo · 2024-05-10T22:27:29Z

I did not notice this on v1.2.2-rc3

albinsun · 2024-05-15T02:23:29Z

Hit in a 3 nodes upgrade test from v1.2.1 -> v1.2.2-rc2. Will test again after RC3 release.

Provide my test run statistics.
I did not hit this too in about 5 trials on the 3 nodes upgrade v1.2.1 to v1.2.2-rc3 (w/ Rancher and RKE2 guest cluster.)

khushboo-rancher · 2024-05-15T17:10:42Z

Closing this in favor of above comments.

bk201 added this to the v1.2.2 milestone Apr 26, 2024

FrankYang0529 mentioned this issue Apr 26, 2024

feat: bump bci image to 15.5 for skip-restart-rancher-system-agent plan #5691

Merged

ibrokethecloud mentioned this issue Apr 29, 2024

check timestamps before waiting for agent rollout #5698

Merged

bk201 assigned ibrokethecloud Apr 29, 2024

bk201 mentioned this issue Apr 30, 2024

[BUG] 3 nodes upgrade from v1.2.1 to v1.2.2-rc1 with Rancher integrated stuck in Upgrading System Service #5712

Closed

lanfon72 self-assigned this May 6, 2024

albinsun mentioned this issue May 8, 2024

[ReleaseTesting] Four node upgrade with bonded NICs harvester/tests#1254

Closed

irishgordo added reproduce/often Reproducible 10% to 99% of the time and removed reproduce/rare Reproducible less than 10% of the time labels May 10, 2024

irishgordo mentioned this issue May 10, 2024

[backport v1.2] [BUG] Upgrade workaround for rancher-system-agent does not get cleaned up #5380

Closed

khushboo-rancher closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

lanfon72 commented Apr 26, 2024 •

edited

harvesterhci-io-github-bot commented Apr 29, 2024 •

edited by bk201

albinsun commented May 8, 2024 •

edited

w13915984028 commented May 8, 2024

w13915984028 commented May 8, 2024

w13915984028 commented May 8, 2024

lanfon72 commented May 8, 2024

w13915984028 commented May 8, 2024 •

edited

irishgordo commented May 10, 2024

irishgordo commented May 10, 2024

irishgordo commented May 10, 2024

albinsun commented May 15, 2024

khushboo-rancher commented May 15, 2024

[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

Comments

lanfon72 commented Apr 26, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Environment:

Additional context

Support bundle

harvesterhci-io-github-bot commented Apr 29, 2024 • edited by bk201

Pre Ready-For-Testing Checklist

albinsun commented May 8, 2024 • edited

w13915984028 commented May 8, 2024

w13915984028 commented May 8, 2024

w13915984028 commented May 8, 2024

lanfon72 commented May 8, 2024

w13915984028 commented May 8, 2024 • edited

irishgordo commented May 10, 2024

irishgordo commented May 10, 2024

irishgordo commented May 10, 2024

albinsun commented May 15, 2024

khushboo-rancher commented May 15, 2024

lanfon72 commented Apr 26, 2024 •

edited

harvesterhci-io-github-bot commented Apr 29, 2024 •

edited by bk201

albinsun commented May 8, 2024 •

edited

w13915984028 commented May 8, 2024 •

edited