Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] upgrade stuck in waiting plan restart-rancher-system-agent to complete #5690

Closed
lanfon72 opened this issue Apr 26, 2024 · 12 comments
Closed
Assignees
Labels
area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release need-reprioritize not-require/test-plan Skip to create a e2e automation test issue reproduce/often Reproducible 10% to 99% of the time severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@lanfon72
Copy link
Member

lanfon72 commented Apr 26, 2024

Describe the bug

Upgrade pending on Upgradeing System Service, when check the log of pod apply-manifests, it shows

Waiting for plan hvst-upgrade-fm29b-skip-restart-rancher-system-agent to complete...

To Reproduce

Steps to reproduce the behavior:

  1. Install Harvester with any nodes
  2. Create an image for VM creation
  3. Create Cluster network and VM network
  4. Setup backup-target
  5. Create new storageclass sc1 with 3 replicas and set it as default
  6. Create VM vm1 with VM network's VLAN
  7. Write data into vm1 then take backup vm1b
  8. Restore vm1b into vm2
  9. Create VM vm3 with additional volume
  10. Perform upgrade

Expected behavior

upgrade should success

Environment:

  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Baremetal DL360 3 nodes
  • Harvester ISO version: v1.2.1 -> v1.2.2-rc1
  • ui-source Option: Auto

Additional context

The upgrade is performed by test automation (locally)

Support bundle

supportbundle_229d2fc0-5c52-42e2-a10e-9cd6a437adb1_2024-04-26T07-43-07Z.zip

@lanfon72 lanfon72 added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact) need-reprioritize area/upgrade reproduce/rare Reproducible less than 10% of the time not-require/test-plan Skip to create a e2e automation test issue labels Apr 26, 2024
@bk201 bk201 added this to the v1.2.2 milestone Apr 26, 2024
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Apr 29, 2024

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Upgrade to v1.2.2-rc2
  • The upgrade should not stuck at waiting plan restart-rancher-system-agent
  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

    • Does the PR include the explanation for the fix or the feature?

    • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
      The PR for the YAML change is at:
      The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at:

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@albinsun
Copy link

albinsun commented May 8, 2024

Hit in a 3 nodes upgrade test from v1.2.1 -> v1.2.2-rc2.
Will test again after RC3 release.
image

@w13915984028
Copy link
Member

The plan object hvst-upgrade-fm29b-skip-restart-rancher-system-agent has no status field from the support-bundle; but other plan objects have.

- apiVersion: upgrade.cattle.io/v1
  kind: Plan
  metadata:
    creationTimestamp: "2024-04-25T20:55:07Z"
    generation: 1
    managedFields:
    - apiVersion: upgrade.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:concurrency: {}
          f:nodeSelector:
            .: {}
            f:matchLabels:
              .: {}
              f:harvesterhci.io/managed: {}
          f:serviceAccountName: {}
          f:tolerations: {}
          f:upgrade:
            .: {}
            f:args: {}
            f:command: {}
            f:image: {}
          f:version: {}
      manager: kubectl-create
      operation: Update
      time: "2024-04-25T20:55:07Z"
    name: hvst-upgrade-fm29b-skip-restart-rancher-system-agent
    namespace: cattle-system
    resourceVersion: "363026"
    uid: 245884a7-fc5f-426f-a7ad-ce87c9f8ed79
  spec:
    concurrency: 10
    nodeSelector:
      matchLabels:
        harvesterhci.io/managed: "true"
    serviceAccountName: system-upgrade-controller
    tolerations:
    - operator: Exists
    upgrade:
      args:
      - sh
      - -c
      - set -x && mkdir -p /run/systemd/system/rancher-system-agent.service.d && echo
        -e '[Service]\nEnvironmentFile=-/run/systemd/system/rancher-system-agent.service.d/10-harvester-upgrade.env'
        | tee /run/systemd/system/rancher-system-agent.service.d/override.conf &&
        echo 'INSTALL_RKE2_SKIP_ENABLE=true' | tee /run/systemd/system/rancher-system-agent.service.d/10-harvester-upgrade.env
        && systemctl daemon-reload && systemctl restart rancher-system-agent.service
      command:
      - chroot
      - /host
      image: registry.suse.com/bci/bci-base:15.4
    version: 72c4fe8c

@w13915984028
Copy link
Member

The system-upgrade-controller seems to have issue from 2024-04-25T20:51:55

rancher-79d47f56-kjzkl/rancher.log

2024-04-25T20:51:55.627240258Z 2024/04/25 20:51:55 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
2024-04-25T20:51:55.758930047Z 2024/04/25 20:51:55 [INFO] [planner] rkecluster fleet-local/local: configuring bootstrap node(s) custom-5c1eb391c3d4: waiting for plan to be applied
2024-04-25T20:51:55.940208473Z 2024/04/25 20:51:55 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
2024-04-25T20:51:56.080572880Z 2024/04/25 20:51:56 [INFO] [planner] rkecluster fleet-local/local: configuring bootstrap node(s) custom-5c1eb391c3d4: waiting for plan to be applied
2024-04-25T20:51:56.083257126Z 2024/04/25 20:51:56 [ERROR] error syncing 'fleet-local/local': handler monitor-system-upgrade-controller-readiness: unable to sync system upgrade controller status for [&TypeMeta{Kind:Cluster,APIVersion:provisioning.cattle.io/v1,}] [fleet-local/local], status.FleetWorkspaceName was blank, requeuing
...

@w13915984028
Copy link
Member

The provisioning.cattle.io/v1 cluster ojbect fleet-local/local object has no status
fleetWorkspaceName: fleet-local (refer the output in another thread: #5718 (comment))

apiVersion: v1
items:
- apiVersion: provisioning.cattle.io/v1
  kind: Cluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","metadata":{"annotations":{},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.25.9+rke2r1","rkeConfig":{"controlPlaneConfig":{"disable":["rke2-snapshot-controller","rke2-snapshot-controller-crd","rke2-snapshot-validation-webhook"]}}}}
      objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
      objectset.rio.cattle.io/id: provisioning-cluster-create
      objectset.rio.cattle.io/owner-gvk: management.cattle.io/v3, Kind=Cluster
      objectset.rio.cattle.io/owner-name: local
      objectset.rio.cattle.io/owner-namespace: "null"
    creationTimestamp: "2024-04-25T14:02:00Z"
    finalizers:
    - wrangler.cattle.io/provisioning-cluster-remove
    - wrangler.cattle.io/rke-cluster-remove
    generation: 2
    labels:
      objectset.rio.cattle.io/hash: 50675339e9ca48d1b72932eb038d75d9d2d44618
      provider.cattle.io: harvester
    managedFields:
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:objectset.rio.cattle.io/applied: {}
            f:objectset.rio.cattle.io/id: {}
            f:objectset.rio.cattle.io/owner-gvk: {}
            f:objectset.rio.cattle.io/owner-name: {}
            f:objectset.rio.cattle.io/owner-namespace: {}
          f:finalizers:
            .: {}
            v:"wrangler.cattle.io/provisioning-cluster-remove"null: {}
            v:"wrangler.cattle.io/rke-cluster-remove"null: {}
          f:labels:
            .: {}
            f:objectset.rio.cattle.io/hash: {}
            f:provider.cattle.io: {}
        f:spec:
          .: {}
          f:localClusterAuthEndpoint: {}
      manager: rancher
      operation: Update
      time: "2024-04-25T14:02:00Z"
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec:
          f:kubernetesVersion: {}
          f:rkeConfig: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: "2024-04-25T14:03:35Z"
    - apiVersion: provisioning.cattle.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:clientSecretName: {}
          f:clusterName: {}
          f:conditions: {}
          f:observedGeneration: {}
          f:ready: {}
      manager: rancher
      operation: Update
      subresource: status
      time: "2024-04-25T20:55:00Z"
    name: local
    namespace: fleet-local
    resourceVersion: "362895"
    uid: 0d2d4ee9-3158-4651-9025-e7a009e84b0d
  spec:
    kubernetesVersion: v1.25.9+rke2r1
    localClusterAuthEndpoint: {}
    rkeConfig: {}
  status:
    clientSecretName: local-kubeconfig
    clusterName: local
    conditions:
    - lastUpdateTime: "2024-04-25T14:04:38Z"
      message: marking control plane as initialized and ready
      reason: Waiting
      status: Unknown
      type: Ready
    - lastUpdateTime: "2024-04-25T14:02:00Z"
      status: "False"
      type: Reconciling
    - lastUpdateTime: "2024-04-25T14:02:00Z"
      status: "False"
      type: Stalled
    - lastUpdateTime: "2024-04-25T14:03:35Z"
      status: "True"
      type: Created
    - lastUpdateTime: "2024-04-25T20:55:00Z"
      status: "True"
      type: RKECluster
    - status: Unknown
      type: DefaultProjectCreated
    - status: Unknown
      type: SystemProjectCreated
    - lastUpdateTime: "2024-04-25T14:02:13Z"
      status: "True"
      type: Connected
    - lastUpdateTime: "2024-04-25T14:29:17Z"
      status: "True"
      type: Updated
    - lastUpdateTime: "2024-04-25T14:29:17Z"
      status: "True"
      type: Provisioned
    observedGeneration: 2
    ready: true
kind: List
metadata:
  continue: "null"
  resourceVersion: "978153"

@lanfon72
Copy link
Member Author

lanfon72 commented May 8, 2024

@w13915984028 could you help to check is it the same root cause of the original case?
I saw that @albinsun's cluster have RKE2 cluster running on it, but the original case haven't.

@w13915984028
Copy link
Member

w13915984028 commented May 8, 2024

I guess our lastest PRs (restart fleet-agent) may have solved this issue. As @albinsun said, #5690 (comment), please validate it on v1.2.2-RC3 release.

@irishgordo
Copy link

I additionally have hit this with rc2 - going to re-check with rc3

@irishgordo irishgordo added reproduce/often Reproducible 10% to 99% of the time and removed reproduce/rare Reproducible less than 10% of the time labels May 10, 2024
@irishgordo
Copy link

bumped to reproduce/often as seems theme with v1.2.2-rc2

@irishgordo
Copy link

I did not notice this on v1.2.2-rc3

@albinsun
Copy link

Hit in a 3 nodes upgrade test from v1.2.1 -> v1.2.2-rc2. Will test again after RC3 release.

Provide my test run statistics.
I did not hit this too in about 5 trials on the 3 nodes upgrade v1.2.1 to v1.2.2-rc3 (w/ Rancher and RKE2 guest cluster.)

@khushboo-rancher
Copy link

Closing this in favor of above comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade kind/bug Issues that are defects reported by users or that we know have reached a real release need-reprioritize not-require/test-plan Skip to create a e2e automation test issue reproduce/often Reproducible 10% to 99% of the time severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

8 participants