[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

bk201 · 2022-11-18T05:48:28Z

Describe the bug

The post-drain job fails if upgrading the same ISO twice.

To Reproduce
Steps to reproduce the behavior:

Create a cluster with master ISO after v1.1.1-rc1.
Upgrade to the same ISO again.
post-drain job will keep retrying with:

hvst-upgrade-ss75g-post-drain-node1-fzl69 +++ VARIANT_ID=Harvester-20221109
hvst-upgrade-ss75g-post-drain-node1-fzl69 +++ GRUB_ENTRY_NAME='Harvester d0cb3f1-dirty'
hvst-upgrade-ss75g-post-drain-node1-fzl69 ++ echo Harvester d0cb3f1-dirty
hvst-upgrade-ss75g-post-drain-node1-fzl69 + CURRENT_OS_VERSION='Harvester d0cb3f1-dirty'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' 'Harvester d0cb3f1-dirty' = 'Harvester d0cb3f1-dirty' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + echo 'Skip upgrading OS. The OS version is already "Harvester d0cb3f1-dirty".'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + return
hvst-upgrade-ss75g-post-drain-node1-fzl69 + clean_up_tmp_files
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' -n '' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' -n '' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + echo 'Clean up tmp files...'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + [[ -n '' ]]
hvst-upgrade-ss75g-post-drain-node1-fzl69 + [[ -n '' ]]
hvst-upgrade-ss75g-post-drain-node1-fzl69 Clean up tmp files...
hvst-upgrade-ss75g-post-drain-node1-fzl69 [Fri Nov 18 05:28:47 UTC 2022] Running "upgrade_node.sh post-drain" errors, will retry after 10 minutes (0 retries)...

Looks like the trap threw the wrong exit code.

Expected behavior

The job should succeed.

Support bundle

Environment

Harvester ISO version:
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

harvesterhci-io-github-bot · 2022-11-18T06:00:34Z

harvesterhci-io-github-bot · 2022-11-18T06:00:35Z

Automation e2e test issue: harvester/tests#618

TachunLin · 2022-11-21T13:53:05Z

Upgrade the same version from v1.1.1-rc2 to v1.1.1-rc2 would prompt the message and prevent to proceed

The current version v1.1.1-rc2 is less than the minimum upgradable version v1.0.3.

Will try following the workaround to proceed the upgrade.

TachunLin · 2022-11-22T15:29:59Z

Build another 3 nodes Harvester cluster to perform the same version v1.1.1-rc2 upgrade test.

After applying the workaround to pause managed chart and pretend v1.0.3 release

The upgrade would stuck in the first node preloading image.
Job has reached the specified backoff limit

We would continue to check more details of it.

starbops · 2022-11-23T01:43:35Z

When testing against the fix, we need to apply the workaround prior to the upgrade. This is basically to make Harvester think of itself as an older and formal release version like v1.0.3 or v1.1.0, but the controller codebase is actually v1.1.1-rc2. However, this leads to an upgrade failure in the image preloading stage:

Controller creates the image preload Plan with v1.0.3 or v1.1.0 container image, but with the do_upgrade_node.sh command (this is introduced in v1.1.1-rc1), which does not exist in the container image
The corresponding Pods failed to spawn due to the "command not found" error
The upgrade ends up in "Job has reached the specified backoff limit"

To better testing on this, we may have to fake the harvester-upgrade:v1.0.3 container image with a v1.1.1-rc2 one by retagging harvester-upgrade:v1.1.1-rc2 as harvester-upgrade:v1.0.3 on each Harvester node:

ctr -n k8s.io image tag docker.io/rancher/harvester-upgrade:v1.1.1-rc2 docker.io/rancher/harvester-upgrade:v1.0.3 --force

Then kick start the upgrade.

TachunLin · 2022-11-23T05:57:19Z

Verified fixed on v1.1.1-rc2. Close this issue.

Result

We build up a v1.1.1-rc2 Harvester cluster and upgrade the same version on existing cluster again.

After we apply the workaround to pause managed chart and pretend v1.1.0 release and also retag harvester-upgrade to v1.1.0.

We can succeed the post-drain job of each node to finish the upgrade

The post-drain job log would continue to finish.

+ upgrade_os                                                                                                                                               │
│ + trap clean_up_tmp_files EXIT                                                                                                                             │
│ ++ source /host/etc/os-release                                                                                                                             │
│ +++ NAME='SLE Micro'                                                                                                                                       │
│ +++ VERSION=5.2                                                                                                                                            │
│ +++ VERSION_ID=5.2                                                                                                                                         │
│ +++ PRETTY_NAME='Harvester v1.1.1-rc2'                                                                                                                     │
│ +++ ID=sle-micro-rancher                                                                                                                                   │
│ +++ ID_LIKE=suse                                                                                                                                           │
│ +++ ANSI_COLOR='0;32'                                                                                                                                      │
│ +++ CPE_NAME=cpe:/o:suse:sle-micro-rancher:5.2                                                                                                             │
│ +++ VARIANT=Harvester                                                                                                                                      │
│ +++ VARIANT_ID=Harvester-20221109                                                                                                                          │
│ +++ GRUB_ENTRY_NAME='Harvester v1.1.1-rc2'                                                                                                                 │
│ ++ echo Harvester v1.1.1-rc2                                                                                                                               │
│ + CURRENT_OS_VERSION='Harvester v1.1.1-rc2'                                                                                                                │
│ + '[' 'Harvester v1.1.1-rc2' = 'Harvester v1.1.1-rc2' ']'                                                                                                  │
│ + echo 'Skip upgrading OS. The OS version is already "Harvester v1.1.1-rc2".'                                                                              │
│ + return                                                                                                                                                   │
│ Skip upgrading OS. The OS version is already "Harvester v1.1.1-rc2".                                                                                       │
│ + clean_up_tmp_files                                                                                                                                       │
│ + '[' -n '' ']'                                                                                                                                            │
│ + '[' -n '' ']'                                                                                                                                            │
│ Clean up tmp files...                                                                                                                                      │
│ + echo 'Clean up tmp files...'                                                                                                                             │
│ + '[' -n '' ']'                                                                                                                                            │
│ + '[' -n '' ']'

Test Information

Test Environment: 3 nodes Harvester remote kvm machines
Harvester version: v1.1.1-rc2

Verify Steps

Create a three nodes v1.1.1-rc2 Harvester cluster
Prepare offline upgrade verion.yaml to v1.1.1-rc2
Apply the version.yaml
Follow the workaround to pause managed chart and pretend v1.1.0 release
Paused managed chart

$ kubectl edit managedchart harvester -n fleet-local

# edit spec.paused: true
spec:
  paused: true

Edit harvester deployment to pretend the cluster is v1.1.0

$ kubectl edit deploy harvester -n harvester-system

      # add the env variable HARVESTER_SERVER_VERSION
      containers:
      - env:
        - name: HARVESTER_SERVER_VERSION
          value: "v1.1.0"
        - name: HARVESTER_SERVER_HTTPS_PORT
          value: "8443

Retag harvester-upgrade:v1.1.1-rc2 as harvester-upgrade:v1.1.0 on each Harvester node:

ctr -n k8s.io image tag docker.io/rancher/harvester-upgrade:v1.1.1-rc2 docker.io/rancher/harvester-upgrade:v1.1.0 --force

Apply the increase job deadline workaround

$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller

Trigger upgrade to same version v1.1.1-rc2

bk201 added this to the v1.1.1 milestone Nov 18, 2022

bk201 assigned starbops Nov 18, 2022

starbops mentioned this issue Nov 18, 2022

fix(upgrade): make sure trap will throw no errors #3176

Merged

harvesterhci-io-github-bot mentioned this issue Nov 18, 2022

[e2e] [BUG] Upgrade: post-drain job fails when upgrading to the same version harvester/tests#618

Closed

1 task

khushboo-rancher assigned TachunLin Nov 18, 2022

TachunLin mentioned this issue Nov 22, 2022

[BUG] Delete upgrade CR would stuck after upgrade failed at preloading image #3192

Open

TachunLin closed this as completed Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

bk201 commented Nov 18, 2022

harvesterhci-io-github-bot commented Nov 18, 2022 •

edited by starbops

harvesterhci-io-github-bot commented Nov 18, 2022

TachunLin commented Nov 21, 2022

TachunLin commented Nov 22, 2022

starbops commented Nov 23, 2022 •

edited

TachunLin commented Nov 23, 2022 •

edited

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

Comments

bk201 commented Nov 18, 2022

harvesterhci-io-github-bot commented Nov 18, 2022 • edited by starbops

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented Nov 18, 2022

TachunLin commented Nov 21, 2022

TachunLin commented Nov 22, 2022

starbops commented Nov 23, 2022 • edited

TachunLin commented Nov 23, 2022 • edited

Result

Test Information

Verify Steps

harvesterhci-io-github-bot commented Nov 18, 2022 •

edited by starbops

starbops commented Nov 23, 2022 •

edited

TachunLin commented Nov 23, 2022 •

edited