Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

Closed
bk201 opened this issue Nov 18, 2022 · 6 comments
Closed

[BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

bk201 opened this issue Nov 18, 2022 · 6 comments
Assignees
Labels
area/upgrade-related kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@bk201
Copy link
Member

bk201 commented Nov 18, 2022

Describe the bug

The post-drain job fails if upgrading the same ISO twice.

To Reproduce
Steps to reproduce the behavior:

  1. Create a cluster with master ISO after v1.1.1-rc1.
  2. Upgrade to the same ISO again.
  3. post-drain job will keep retrying with:
hvst-upgrade-ss75g-post-drain-node1-fzl69 +++ VARIANT_ID=Harvester-20221109
hvst-upgrade-ss75g-post-drain-node1-fzl69 +++ GRUB_ENTRY_NAME='Harvester d0cb3f1-dirty'
hvst-upgrade-ss75g-post-drain-node1-fzl69 ++ echo Harvester d0cb3f1-dirty
hvst-upgrade-ss75g-post-drain-node1-fzl69 + CURRENT_OS_VERSION='Harvester d0cb3f1-dirty'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' 'Harvester d0cb3f1-dirty' = 'Harvester d0cb3f1-dirty' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + echo 'Skip upgrading OS. The OS version is already "Harvester d0cb3f1-dirty".'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + return
hvst-upgrade-ss75g-post-drain-node1-fzl69 + clean_up_tmp_files
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' -n '' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + '[' -n '' ']'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + echo 'Clean up tmp files...'
hvst-upgrade-ss75g-post-drain-node1-fzl69 + [[ -n '' ]]
hvst-upgrade-ss75g-post-drain-node1-fzl69 + [[ -n '' ]]
hvst-upgrade-ss75g-post-drain-node1-fzl69 Clean up tmp files...
hvst-upgrade-ss75g-post-drain-node1-fzl69 [Fri Nov 18 05:28:47 UTC 2022] Running "upgrade_node.sh post-drain" errors, will retry after 10 minutes (0 retries)...

Looks like the trap threw the wrong exit code.

Expected behavior

The job should succeed.

Support bundle

Environment

  • Harvester ISO version:
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context
Add any other context about the problem here.

@bk201 bk201 added kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release severity/1 Function broken (a critical incident with very high impact) area/upgrade-related reproduce/always Reproducible 100% of the time labels Nov 18, 2022
@bk201 bk201 added this to the v1.1.1 milestone Nov 18, 2022
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Nov 18, 2022

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] Upgrade: post-drain job fails when upgrading to the same version #3175

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at: fix(upgrade): make sure trap will throw no errors #3176

    • Does the PR include the explanation for the fix or the feature?

    • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
      The PR for the YAML change is at:
      The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at:

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@harvesterhci-io-github-bot

Automation e2e test issue: harvester/tests#618

@TachunLin
Copy link

Upgrade the same version from v1.1.1-rc2 to v1.1.1-rc2 would prompt the message and prevent to proceed

The current version v1.1.1-rc2 is less than the minimum upgradable version v1.0.3.

image

Will try following the workaround to proceed the upgrade.

@TachunLin
Copy link

Build another 3 nodes Harvester cluster to perform the same version v1.1.1-rc2 upgrade test.

After applying the workaround to pause managed chart and pretend v1.0.3 release

  • The upgrade would stuck in the first node preloading image.
    Job has reached the specified backoff limit
    image

We would continue to check more details of it.

@starbops
Copy link
Member

starbops commented Nov 23, 2022

When testing against the fix, we need to apply the workaround prior to the upgrade. This is basically to make Harvester think of itself as an older and formal release version like v1.0.3 or v1.1.0, but the controller codebase is actually v1.1.1-rc2. However, this leads to an upgrade failure in the image preloading stage:

  1. Controller creates the image preload Plan with v1.0.3 or v1.1.0 container image, but with the do_upgrade_node.sh command (this is introduced in v1.1.1-rc1), which does not exist in the container image
  2. The corresponding Pods failed to spawn due to the "command not found" error
  3. The upgrade ends up in "Job has reached the specified backoff limit"

Screen Shot 2022-11-22 at 22 35 25

Screen Shot 2022-11-22 at 22 35 16

To better testing on this, we may have to fake the harvester-upgrade:v1.0.3 container image with a v1.1.1-rc2 one by retagging harvester-upgrade:v1.1.1-rc2 as harvester-upgrade:v1.0.3 on each Harvester node:

ctr -n k8s.io image tag docker.io/rancher/harvester-upgrade:v1.1.1-rc2 docker.io/rancher/harvester-upgrade:v1.0.3 --force

Then kick start the upgrade.

@TachunLin
Copy link

TachunLin commented Nov 23, 2022

Verified fixed on v1.1.1-rc2. Close this issue.

Result

We build up a v1.1.1-rc2 Harvester cluster and upgrade the same version on existing cluster again.

After we apply the workaround to pause managed chart and pretend v1.1.0 release and also retag harvester-upgrade to v1.1.0.

  • We can succeed the post-drain job of each node to finish the upgrade
    image

  • The post-drain job log would continue to finish.

    + upgrade_os                                                                                                                                               │
    │ + trap clean_up_tmp_files EXIT                                                                                                                             │
    │ ++ source /host/etc/os-release                                                                                                                             │
    │ +++ NAME='SLE Micro'                                                                                                                                       │
    │ +++ VERSION=5.2                                                                                                                                            │
    │ +++ VERSION_ID=5.2                                                                                                                                         │
    │ +++ PRETTY_NAME='Harvester v1.1.1-rc2'                                                                                                                     │
    │ +++ ID=sle-micro-rancher                                                                                                                                   │
    │ +++ ID_LIKE=suse                                                                                                                                           │
    │ +++ ANSI_COLOR='0;32'                                                                                                                                      │
    │ +++ CPE_NAME=cpe:/o:suse:sle-micro-rancher:5.2                                                                                                             │
    │ +++ VARIANT=Harvester                                                                                                                                      │
    │ +++ VARIANT_ID=Harvester-20221109                                                                                                                          │
    │ +++ GRUB_ENTRY_NAME='Harvester v1.1.1-rc2'                                                                                                                 │
    │ ++ echo Harvester v1.1.1-rc2                                                                                                                               │
    │ + CURRENT_OS_VERSION='Harvester v1.1.1-rc2'                                                                                                                │
    │ + '[' 'Harvester v1.1.1-rc2' = 'Harvester v1.1.1-rc2' ']'                                                                                                  │
    │ + echo 'Skip upgrading OS. The OS version is already "Harvester v1.1.1-rc2".'                                                                              │
    │ + return                                                                                                                                                   │
    │ Skip upgrading OS. The OS version is already "Harvester v1.1.1-rc2".                                                                                       │
    │ + clean_up_tmp_files                                                                                                                                       │
    │ + '[' -n '' ']'                                                                                                                                            │
    │ + '[' -n '' ']'                                                                                                                                            │
    │ Clean up tmp files...                                                                                                                                      │
    │ + echo 'Clean up tmp files...'                                                                                                                             │
    │ + '[' -n '' ']'                                                                                                                                            │
    │ + '[' -n '' ']'    
    

Test Information

  • Test Environment: 3 nodes Harvester remote kvm machines
  • Harvester version: v1.1.1-rc2

Verify Steps

  1. Create a three nodes v1.1.1-rc2 Harvester cluster
  2. Prepare offline upgrade verion.yaml to v1.1.1-rc2
  3. Apply the version.yaml
  4. Follow the workaround to pause managed chart and pretend v1.1.0 release
  5. Paused managed chart
$ kubectl edit managedchart harvester -n fleet-local

# edit spec.paused: true
spec:
  paused: true
  1. Edit harvester deployment to pretend the cluster is v1.1.0
$ kubectl edit deploy harvester -n harvester-system

      # add the env variable HARVESTER_SERVER_VERSION
      containers:
      - env:
        - name: HARVESTER_SERVER_VERSION
          value: "v1.1.0"
        - name: HARVESTER_SERVER_HTTPS_PORT
          value: "8443
  1. Retag harvester-upgrade:v1.1.1-rc2 as harvester-upgrade:v1.1.0 on each Harvester node:
ctr -n k8s.io image tag docker.io/rancher/harvester-upgrade:v1.1.1-rc2 docker.io/rancher/harvester-upgrade:v1.1.0 --force
  1. Apply the increase job deadline workaround
$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller
  1. Trigger upgrade to same version v1.1.1-rc2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade-related kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

4 participants