Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Harvester pod crashes after upgrading from v0.3.0 to v1.0.0-rc1 (contain vm backup before upgrade) #1644

Closed
Tracked by #1665
TachunLin opened this issue Dec 8, 2021 · 3 comments
Assignees
Labels
area/multi-tenancy area/storage blocker blocker of major functionality kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@TachunLin
Copy link

Describe the bug

Prepare a 4 nodes harvester cluster with v0.3.0 with a vm backup then do manual upgrade to v1.0.0-rc1
Can't access harvester, harvester pod crashes after the upgrade.

To Reproduce
Steps to reproduce the behavior:

  1. Download harvester v0.3.0 iso and do checksum
  2. Download harvester v1.0.0 iso and do checksum
  3. Use ISO Install a 4 nodes harvester cluster
  4. Create several OS images from URL
  5. Create ssh key
  6. Enable vlan network with harvester-mgmt
  7. Create virtual network vlan1 with id 1
  8. Create 2 virtual machines
  • ubuntu-vm: 2 core, 4GB memory, 30GB disk
  • centos-vm: 2 core, 4GB memory, 30GB disk

image

  1. Setup backup target
    image

  2. Take a backup from ubuntu vm
    image

upgrade process
Follow the manual upgrade steps to upgrade from v0.3.0 to v1.0.0-rc1
https://github.com/harvester/docs/pull/67/files

Expected behavior

Can manual upgrade harvester from v0.3.0 to v1.0.0-rc1 with vm backup.
Harvester pods keeps working without crash

Support bundle

bundle.zip

Environment:

  • Harvester ISO version before upgrade: v0.3.0
  • Harvester ISO version after upgrade: v1.0.0-rc1
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 4 nodes harvester cluster on local kvm machines

Harvester network information

  • VIP: 192.168.122.194
  • node1: 192.168.122.36
  • node2: 192.168.122.85
  • node3: 192.168.122.186
  • node4: 192.168.122.97

image

Additional context
Add any other context about the problem here.

E1207 11:10:55.797690       8 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x22a7b80, 0x3e2af10)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x22a7b80, 0x3e2af10)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/harvester/harvester/pkg/controller/master/backup.IsBackupTargetSame(...)
	/go/src/github.com/harvester/harvester/pkg/controller/master/backup/util.go:31
github.com/harvester/harvester/pkg/controller/master/backup.(*Handler).uploadVMBackupMetadata(0xc00019ec00, 0xc008580000, 0xc00550d100, 0x0, 0x0)
	/go/src/github.com/harvester/harvester/pkg/controller/master/backup/backup.go:574 +0x6c
github.com/harvester/harvester/pkg/controller/master/backup.(*Handler).OnBackupChange(0xc00019ec00, 0xc0010f1050, 0x24, 0xc0029ea000, 0xc00046c180, 0x60, 0x58)
	/go/src/github.com/harvester/harvester/pkg/controller/master/backup/backup.go:130 +0x405
github.com/harvester/harvester/pkg/generated/controllers/harvesterhci.io/v1beta1.FromVirtualMachineBackupHandlerToHandler.func1(0xc0010f1050, 0x24, 0x29c3d40, 0xc0029ea000, 0x756ea12c64a09e, 0x73307081352ddbde, 0x403bcb, 0xc00578af20)
	/go/src/github.com/harvester/harvester/pkg/generated/controllers/harvesterhci.io/v1beta1/virtualmachinebackup.go:102 +0x6b
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.SharedControllerHandlerFunc.OnChange(0xc00578af00, 0xc0010f1050, 0x24, 0x29c3d40, 0xc0029ea000, 0xc00578af20, 0x756ea188bc575c, 0x40a63f, 0xc00003a000)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/sharedcontroller.go:29 +0x4e
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*SharedHandler).OnChange(0xc0007120c0, 0xc0010f1050, 0x24, 0x29c3d40, 0xc0029ea000, 0xc00292b801, 0x0)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/sharedhandler.go:69 +0x14c
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*controller).syncHandler(0xc0008ed8c0, 0xc0010f1050, 0x24, 0xc00292b948, 0x43b100)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/controller.go:215 +0xd1
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*controller).processSingleItem(0xc0008ed8c0, 0x21813a0, 0xc00578af20, 0x0, 0x0)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/controller.go:197 +0xe7
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*controller).processNextWorkItem(0xc0008ed8c0, 0x203000)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/controller.go:174 +0x54
github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*controller).runWorker(...)
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/controller.go:163
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00230e130)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00230e130, 0x299e500, 0xc0006b5fb0, 0x1, 0xc0001151a0)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00230e130, 0x3b9aca00, 0x0, 0xc001820201, 0xc0001151a0)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00230e130, 0x3b9aca00, 0xc0001151a0)
	/go/src/github.com/harvester/harvester/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller.(*controller).run
	/go/src/github.com/harvester/harvester/vendor/github.com/rancher/lasso/pkg/controller/controller.go:134 +0x33b
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1e7920c]
@TachunLin TachunLin added kind/bug Issues that are defects reported by users or that we know have reached a real release area/storage severity/1 Function broken (a critical incident with very high impact) area/multi-tenancy labels Dec 8, 2021
@guangbochen guangbochen added the blocker blocker of major functionality label Dec 8, 2021
@TachunLin TachunLin added this to the v1.0.0 milestone Dec 13, 2021
@guangbochen
Copy link
Contributor

should be addressed by #1642

@FrankYang0529
Copy link
Member

I try to upgrade Harvester from v0.3.0 to master-head and Harvester pods don't crash. However, VMBackup CRD doesn't be updated, so we have some error messages like this:

time="2021-12-14T10:00:19Z" level=error msg="error syncing 'ssl-certificates': handler harvester-setting-controller: helmchartconfigs.helm.cattle.io \"rke2-ingress-nginx\" not found, requeuing"
time="2021-12-14T10:02:18Z" level=error msg="error syncing 'default/s3-1': handler harvester-vm-backup-controller: no backup target in vmbackup.status, requeuing"

@TachunLin
Copy link
Author

Verified fixed after upgrade from v0.3.0 to master-935a1670-head.(12/17 based on v1.0.0-rc1)
Close this issue.

Result

After manual upgrade from v0.3.0 to v1.0.0-rc1 (master-935a1670-head).

1. Harvester pods did not crash -> PASS

harvester-node01-upgrade100rc1:/home/rancher # kubectl get pods -n harvester-system
NAME                                                     READY   STATUS      RESTARTS   AGE
harvester-d544ddb6f-52mdk                                1/1     Running     0          58m
harvester-d544ddb6f-mg4dc                                1/1     Running     0          58m
harvester-d544ddb6f-npd7c                                1/1     Running     0          58m
harvester-load-balancer-59bf75f489-57nnp                 1/1     Running     0          58m
harvester-network-controller-68m6r                       1/1     Running     0          57m
harvester-network-controller-cdrjp                       1/1     Running     0          57m
harvester-network-controller-jfg96                       1/1     Running     0          58m
harvester-network-controller-manager-c57f8cbcb-67mcq     1/1     Running     0          58m
harvester-network-controller-manager-c57f8cbcb-9mqbx     1/1     Running     0          58m
harvester-network-controller-v2k2n                       1/1     Running     0          57m
harvester-node-disk-manager-29xrr                        1/1     Running     0          57m
harvester-node-disk-manager-v27hz                        1/1     Running     0          57m
harvester-node-disk-manager-wwpwf                        1/1     Running     0          58m
harvester-node-disk-manager-zwbxv                        1/1     Running     0          57m
harvester-promote-harvester-node02-upgrade100rc1-nphjz   0/1     Completed   0          9h
harvester-promote-harvester-node03-upgrade100rc1-72lj7   0/1     Completed   0          9h
harvester-webhook-67744f845f-pmrlg                       1/1     Running     0          57m
harvester-webhook-67744f845f-r5c44                       1/1     Running     0          58m
harvester-webhook-67744f845f-tqjkl                       1/1     Running     0          57m
kube-vip-2l2qp                                           1/1     Running     1          56m
kube-vip-cloud-provider-0                                1/1     Running     34         10h
kube-vip-cvklf                                           1/1     Running     0          56m
kube-vip-q99lt                                           1/1     Running     0          56m
virt-api-86455cdb7d-2hb4x                                1/1     Running     3          10h
virt-api-86455cdb7d-q8fpc                                1/1     Running     3          10h
virt-controller-5f649999dd-q5bqs                         1/1     Running     18         10h
virt-controller-5f649999dd-sqh9g                         1/1     Running     20         10h
virt-handler-4ncxn                                       1/1     Running     3          10h
virt-handler-cmzg2                                       1/1     Running     3          9h
virt-handler-k8pg9                                       1/1     Running     3          10h
virt-handler-x754t                                       1/1     Running     2          8h
virt-operator-56c5bdc7b8-cgwc8                           1/1     Running     28         10h

2. Check whether longhornBackupName is in each VM Backup. -> PASS

$ kubectl get backup -A
longhorn-system   backup-7933c0d09ec04d1a   snapshot-b1fbfcf4-ad45-442d-9bb6-8119e713d892   1367343104     2021-12-17T07:13:00Z   Completed   2021-12-17T07:17:51.097572079Z

$ kubectl get vmbackup ubuntu-backup -o yaml | less

volumeBackups:
  - creationTime: "2021-12-17T07:12:59Z"
    longhornBackupName: backup-7933c0d09ec04d1a
    name: ubuntu-backup-volume-ubuntu-vm-disk-0-ylodf

3. Check whether there is -.cfg file in backup target. -> PASS
There is a default-ubuntu-backup.cfg file exists on S3 backup remote bucket

$ cat default-ubuntu-backup.cfg

image

4. Check whether you can restore VM. -> PASS
Can create a new vm from restore the existing backup

![image](https://user-images.githubusercontent.com/29251855/146564664-46809072-a320-44a6-8665-29aa1b8d936f.png)


Environment:

  • Harvester ISO version before upgrade: v0.3.0
  • Harvester ISO version after upgrade: v1.0.0-rc1
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 4 nodes harvester cluster on local kvm machines

Harvester node information

  • VIP: 192.168.122.71
  • node1: 192.168.122.253 (6 core, 12GB, 200GB)
  • node2: 192.168.122.10 (6 core, 12GB, 200GB)
  • node3: 192.168.122.36 (4 core, 12GB, 200GB)
  • node4: 192.168.122.97 (4 core, 10GB, 200GB)

Verify Steps

  1. Download harvester v0.3.0 iso and do checksum
  2. Download harvester v1.0.0 iso and do checksum
  3. Use ISO Install a 4 nodes harvester cluster
  4. Create several OS images from URL
  5. Create ssh key
  6. Enable vlan network with harvester-mgmt
  7. Create virtual network vlan1 with id 1
  8. Create 2 virtual machines
  • ubuntu-vm: 2 core, 4GB memory, 30GB disk
  1. Setup backup target
  2. Take a backup from ubuntu vm

upgrade process
Follow the manual upgrade steps to upgrade from v0.3.0 to v1.0.0-rc1
https://github.com/harvester/docs/pull/67/files

Add the following content to /usr/local/harvester-upgrade/upgrade-helpers/manifests/10-harvester.yaml
before upgrade harvester controller node

--
apiVersion: management.cattle.io/v3
kind: ManagedChart
metadata:
  name: harvester-crd
  namespace: fleet-local
spec:
  chart: harvester-crd
  releaseName: harvester-crd
  version: 0.0.0-dev
  defaultNamespace: harvester-system
  repoName: harvester-charts
  # takeOwnership will force apply this chart without checking ownership in labels and annotations.
  # https://github.com/rancher/fleet/blob/ce9c0d6c0a455d61e87c0f19df79d0ee11a89eeb/pkg/helmdeployer/deployer.go#L323
  # https://github.com/rancher/helm/blob/ee91a121e0aa301fcef2bfbc7184f96edd4b50f5/pkg/action/validate.go#L71-L76
  takeOwnership: true
  targets:
  - clusterName: local
    clusterSelector:
      matchExpressions:
      - key: provisioning.cattle.io/unmanaged-system-agent
        operator: DoesNotExist
  values: {}

Additional Context

Since issue #1645, currently we can't access harvester dashboard by VIP
And since #1666, after upgrade we are still not able to login with original admin password

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multi-tenancy area/storage blocker blocker of major functionality kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

3 participants