You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug: Migration failure ended into data corruption.
Expected behaviour: Even if the migration failed the pool should have been renamed and data should not be corrupted.
Steps to reproduce the bug:
Created a SPC in 1.7.0 then upgraded it to 2.4.0
mayadata:upgrade$ kubectl get spc,csp
NAME AGE
storagepoolclaim.openebs.io/cstor-pool 82m
NAME ALLOCATED FREE CAPACITY STATUS READONLY TYPE AGE
cstorpool.openebs.io/cstor-pool-3w1a 334K 39.7G 39.8G Healthy false striped 82m
Started migration and made it fail just after the CSPI became online
mayadata:upgrade$ k get cspc,cspi
NAME HEALTHYINSTANCES PROVISIONEDINSTANCES DESIREDINSTANCES AGE
cstorpoolcluster.cstor.openebs.io/cstor-pool 1 1 1 40m
NAME HOSTNAME FREE CAPACITY READONLY PROVISIONEDREPLICAS HEALTHYREPLICAS STATUS AGE
cstorpoolinstance.cstor.openebs.io/cstor-pool-972g 127.0.0.1 38500M 38500378k false 1 0 ONLINE 40m
Checked the zpool status on CSPI
mayadata:upgrade$ kubectl -n openebs exec -it cstor-pool-972g-7f4cfdd794-z598d -c cstor-pool-mgmt -- bash
root@cstor-pool-972g-7f4cfdd794-z598d:/# zpool status
pool: cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774 ONLINE 0 0 0
/var/openebs/sparse/3-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/0-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/1-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/2-ndm-sparse.img ONLINE 0 0 0
errors: No known data errors
Then scaled up the old CSP deploy and checked the pool status (unexpected behaviour)
mayadata:openebs$ kubectl -n openebs exec -it cstor-pool-3w1a-56695f78b7-x957h -c cstor-pool-mgmt -- bash
root@cstor-pool-3w1a-56695f78b7-x957h:/# zpool status
pool: cstor-76aad699-4e5f-4bd5-9a1b-16008d0d5c54
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
cstor-76aad699-4e5f-4bd5-9a1b-16008d0d5c54 ONLINE 0 0 0
/var/openebs/sparse/3-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/0-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/1-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/2-ndm-sparse.img ONLINE 0 0 0
errors: No known data errors
Was able to still write data using the application pod
Restarted the CSPI pod and pool got imported but the status gives an error
mayadata:migrate$ k logs -f cstor-pool-972g-7f4cfdd794-2g8l2 -c cstor-pool-mgmt
+ rm /usr/local/bin/zrepl
+ pool_manager_pid=7
+ /usr/local/bin/pool-manager start
+ trap _sigint INT
+ trap _sigterm SIGTERM
+ wait 7
E0112 10:35:27.740140 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:27.740345 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:30.751974 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:30.752010 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:33.755995 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:33.756057 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:36.770793 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:36.770879 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:39.783352 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:39.783374 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:42.787035 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:42.787113 7 pool.go:123] Waiting for pool container to start...
E0112 10:35:45.797694 7 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 10:35:45.797771 7 pool.go:123] Waiting for pool container to start...
I0112 10:35:45.809247 7 controller.go:109] Setting up event handlers for CSPI
I0112 10:35:45.809704 7 controller.go:115] will set up informer event handlers for cvr
I0112 10:35:45.810120 7 new_restore_controller.go:105] Setting up event handlers for restore
I0112 10:35:45.886391 7 controller.go:110] Setting up event handlers for backup
I0112 10:35:45.893357 7 runner.go:38] Starting CStorPoolInstance controller
I0112 10:35:45.893409 7 runner.go:41] Waiting for informer caches to sync
I0112 10:35:45.909280 7 common.go:262] CStorPool found: [cannot open 'name': no such pool ]
I0112 10:35:45.909483 7 run_restore_controller.go:38] Starting CStorRestore controller
I0112 10:35:45.909525 7 run_restore_controller.go:41] Waiting for informer caches to sync
I0112 10:35:45.909556 7 run_restore_controller.go:53] Started CStorRestore workers
I0112 10:35:45.909674 7 runner.go:39] Starting CStorVolumeReplica controller
I0112 10:35:45.909706 7 runner.go:42] Waiting for informer caches to sync
I0112 10:35:45.909727 7 runner.go:47] Starting CStorVolumeReplica workers
I0112 10:35:45.909749 7 runner.go:54] Started CStorVolumeReplica workers
I0112 10:35:45.909893 7 runner.go:38] Starting CStorBackup controller
I0112 10:35:45.909926 7 runner.go:41] Waiting for informer caches to sync
I0112 10:35:45.993629 7 runner.go:45] Starting CStorPoolInstance workers
I0112 10:35:45.993667 7 runner.go:51] Started CStorPoolInstance workers
I0112 10:35:46.010362 7 runner.go:53] Started CStorBackup workers
I0112 10:35:46.017415 7 import.go:73] Importing pool 764d0038-cb8d-4b34-8ef8-5fb1efa80081 cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774
I0112 10:35:51.166697 7 event.go:281] Event(v1.ObjectReference{Kind:"CStorPoolInstance", Namespace:"openebs", Name:"cstor-pool-972g", UID:"764d0038-cb8d-4b34-8ef8-5fb1efa80081", APIVersion:"cstor.openebs.io/v1", ResourceVersion:"9230", FieldPath:""}): type: 'Normal' reason: 'Pool Imported' Pool Import successful: cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774
^C
mayadata:migrate$ k exec -it cstor-pool-972g-7f4cfdd794-2g8l2 -c cstor-pool-mgmt -- bash
root@cstor-pool-972g-7f4cfdd794-2g8l2:/# zpool status
pool: cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: none requested
config:
NAME STATE READ WRITE CKSUM
cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774 ONLINE 0 0 7
/var/openebs/sparse/3-ndm-sparse.img ONLINE 0 0 14
/var/openebs/sparse/0-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/1-ndm-sparse.img ONLINE 0 0 0
/var/openebs/sparse/2-ndm-sparse.img ONLINE 0 0 0
errors: 1 data errors, use '-v' for a list
Still able to write the data using the pod.
Restarted the CSP pool pod the import failed (expected behaviour)
mayadata:upgrade$ k logs -f cstor-pool-3w1a-56695f78b7-nb2zp -c cstor-pool-mgmt
+ rm /usr/local/bin/zrepl
+ exec /usr/local/bin/cstor-pool-mgmt start
E0112 11:09:32.888080 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:32.888334 7 pool.go:502] Waiting for zpool replication container to start...
E0112 11:09:35.896036 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:35.896298 7 pool.go:502] Waiting for zpool replication container to start...
E0112 11:09:38.903751 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:38.903805 7 pool.go:502] Waiting for zpool replication container to start...
E0112 11:09:41.912888 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:41.912968 7 pool.go:502] Waiting for zpool replication container to start...
E0112 11:09:44.920051 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:44.920155 7 pool.go:502] Waiting for zpool replication container to start...
E0112 11:09:47.928038 7 pool.go:501] zpool status returned error in zrepl startup : exit status 1
I0112 11:09:47.928138 7 pool.go:502] Waiting for zpool replication container to start...
I0112 11:09:47.983445 7 common.go:218] CStorPool CRD found
I0112 11:09:47.987162 7 common.go:236] CStorVolumeReplica CRD found
I0112 11:09:47.987794 7 new_pool_controller.go:103] Setting up event handlers
I0112 11:09:47.988014 7 new_replica_controller.go:118] will set up informer event handlers for cvr
I0112 11:09:47.988181 7 new_backup_controller.go:104] Setting up event handlers for backup
I0112 11:09:47.990730 7 new_restore_controller.go:103] Setting up event handlers for restore
I0112 11:09:47.993062 7 run_pool_controller.go:43] Starting CStorPool controller
I0112 11:09:47.993095 7 run_pool_controller.go:46] Waiting for informer caches to sync
I0112 11:09:47.996167 7 new_pool_controller.go:125] cStorPool Added event : cstor-pool-3w1a, 76aad699-4e5f-4bd5-9a1b-16008d0d5c54
I0112 11:09:47.997357 7 event.go:281] Event(v1.ObjectReference{Kind:"CStorPool", Namespace:"", Name:"cstor-pool-3w1a", UID:"76aad699-4e5f-4bd5-9a1b-16008d0d5c54", APIVersion:"openebs.io/v1alpha1", ResourceVersion:"13474", FieldPath:""}): type: 'Normal' reason: 'Synced' Received Resource create event
W0112 11:09:47.997459 7 common.go:271] CStorPool not found. Retrying after 5s, err: <nil>
I0112 11:09:47.997871 7 handler.go:598] cVR 'pvc-cb2f311d-b114-4927-bf1b-ab30738a270d-cstor-pool-3w1a': uid '6109daf9-a239-4049-b255-1aaf9671a7e0': phase 'Healthy': is_empty_status: false
I0112 11:09:47.998211 7 event.go:281] Event(v1.ObjectReference{Kind:"CStorVolumeReplica", Namespace:"openebs", Name:"pvc-cb2f311d-b114-4927-bf1b-ab30738a270d-cstor-pool-3w1a", UID:"6109daf9-a239-4049-b255-1aaf9671a7e0", APIVersion:"openebs.io/v1alpha1", ResourceVersion:"13475", FieldPath:""}): type: 'Normal' reason: 'Synced' Received Resource create event
I0112 11:09:48.093300 7 run_pool_controller.go:50] Starting CStorPool workers
I0112 11:09:48.093360 7 run_pool_controller.go:56] Started CStorPool workers
I0112 11:09:48.236208 7 new_pool_controller.go:167] cStorPool Modify event : cstor-pool-3w1a, 76aad699-4e5f-4bd5-9a1b-16008d0d5c54
I0112 11:09:48.237655 7 event.go:281] Event(v1.ObjectReference{Kind:"CStorPool", Namespace:"", Name:"cstor-pool-3w1a", UID:"76aad699-4e5f-4bd5-9a1b-16008d0d5c54", APIVersion:"openebs.io/v1alpha1", ResourceVersion:"13490", FieldPath:""}): type: 'Normal' reason: 'Synced' Received Resource modify event
E0112 11:09:48.574618 7 run_pool_controller.go:117] error syncing 'cstor-pool-3w1a': expected csp object but got
cstorpool {null
}
W0112 11:09:53.005226 7 common.go:271] CStorPool not found. Retrying after 5s, err: <nil>
W0112 11:09:58.013215 7 common.go:271] CStorPool not found. Retrying after 5s, err: <nil>
W0112 11:10:03.021787 7 common.go:271] CStorPool not found. Retrying after 5s, err: <nil>
^C
mayadata:upgrade$ k exec -it cstor-pool-3w1a-56695f78b7-nb2zp -- bash
Defaulting container name to cstor-pool.
Use 'kubectl describe pod/cstor-pool-3w1a-56695f78b7-nb2zp -n openebs' to see all of the containers in this pod.
root@cstor-pool-3w1a-56695f78b7-nb2zp:/# zpool status
no pools available
root@cstor-pool-3w1a-56695f78b7-nb2zp:/# zpool import
2021-01-12/11:10:45.346 Iterating over all the devices to find zfs devices using blkid
2021-01-12/11:10:45.377 Iterated over cache devices to find zfs devices
no pools available to import
root@cstor-pool-3w1a-56695f78b7-nb2zp:/#
Again restarted the CSPI pool pod and ended up in the user issue
mayadata:upgrade$ k logs -f cstor-pool-972g-7f4cfdd794-f2lsm -c cstor-pool-mgmt
+ rm /usr/local/bin/zrepl
+ pool_manager_pid=8
+ trap _sigint INT
+ /usr/local/bin/pool-manager start
+ trap _sigterm SIGTERM
+ wait 8
E0112 11:13:02.634184 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:02.634240 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:05.637713 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:05.637805 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:08.653611 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:08.653714 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:11.668001 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:11.668128 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:14.680239 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:14.680294 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:17.690164 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:17.690218 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:20.702640 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:20.702696 8 pool.go:123] Waiting for pool container to start...
E0112 11:13:23.717248 8 pool.go:122] zpool status returned error in zrepl startup : exit status 1
I0112 11:13:23.717277 8 pool.go:123] Waiting for pool container to start...
I0112 11:13:23.723416 8 controller.go:109] Setting up event handlers for CSPI
I0112 11:13:23.723781 8 controller.go:115] will set up informer event handlers for cvr
I0112 11:13:23.724125 8 new_restore_controller.go:105] Setting up event handlers for restore
I0112 11:13:23.733100 8 controller.go:110] Setting up event handlers for backup
I0112 11:13:23.737086 8 runner.go:38] Starting CStorPoolInstance controller
I0112 11:13:23.737111 8 runner.go:41] Waiting for informer caches to sync
I0112 11:13:23.743502 8 common.go:262] CStorPool found: [cannot open 'name': no such pool ]
I0112 11:13:23.743575 8 run_restore_controller.go:38] Starting CStorRestore controller
I0112 11:13:23.743584 8 run_restore_controller.go:41] Waiting for informer caches to sync
I0112 11:13:23.743595 8 run_restore_controller.go:53] Started CStorRestore workers
I0112 11:13:23.743643 8 runner.go:39] Starting CStorVolumeReplica controller
I0112 11:13:23.743655 8 runner.go:42] Waiting for informer caches to sync
I0112 11:13:23.743662 8 runner.go:47] Starting CStorVolumeReplica workers
I0112 11:13:23.743670 8 runner.go:54] Started CStorVolumeReplica workers
I0112 11:13:23.743719 8 runner.go:38] Starting CStorBackup controller
I0112 11:13:23.743732 8 runner.go:41] Waiting for informer caches to sync
I0112 11:13:23.743742 8 runner.go:53] Started CStorBackup workers
I0112 11:13:23.837328 8 runner.go:45] Starting CStorPoolInstance workers
I0112 11:13:23.837409 8 runner.go:51] Started CStorPoolInstance workers
I0112 11:13:23.891344 8 import.go:73] Importing pool 764d0038-cb8d-4b34-8ef8-5fb1efa80081 cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774
E0112 11:13:24.039603 8 import.go:94] Failed to import pool by reading cache file: cannot import 'cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774': I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Tue Jan 12 11:13:10 2021
should correct the problem. Approximately 5 seconds of data
must be discarded, irreversibly. Recovery can be attempted
by executing 'zpool import -F cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774'. A scrub of the pool
is strongly recommended after recovery.
: exit status 1
E0112 11:13:25.375807 8 import.go:114] Failed to import pool by scanning directory: 2021-01-12/11:13:24.042 Verifying pool existence on the device /var/openebs/sparse/0-ndm-sparse.img
2021-01-12/11:13:24.042 Verifying pool existence on the device /var/openebs/sparse/3-ndm-sparse.img
2021-01-12/11:13:24.042 Verifying pool existence on the device /var/openebs/sparse/4-ndm-sparse.img
2021-01-12/11:13:24.043 Verifying pool existence on the device /var/openebs/sparse/2-ndm-sparse.img
2021-01-12/11:13:24.043 Skipping /var/openebs/sparse/4-ndm-sparse.img device due to no labels on device
2021-01-12/11:13:24.043 Verifying pool existence on the device /var/openebs/sparse/shared-cstor-pool
2021-01-12/11:13:24.043 ERROR Skipping /var/openebs/sparse/shared-cstor-pool device due to failure in read stats or it is not a regular file/block device
2021-01-12/11:13:24.042 Verifying pool existence on the device /var/openebs/sparse/1-ndm-sparse.img
2021-01-12/11:13:25.069 Verified the device /var/openebs/sparse/1-ndm-sparse.img for pool existence
2021-01-12/11:13:25.081 Verified the device /var/openebs/sparse/3-ndm-sparse.img for pool existence
2021-01-12/11:13:25.092 Verified the device /var/openebs/sparse/2-ndm-sparse.img for pool existence
2021-01-12/11:13:25.107 Verified the device /var/openebs/sparse/0-ndm-sparse.img for pool existence
cannot import 'cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774': I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Tue Jan 12 11:13:10 2021
should correct the problem. Approximately 5 seconds of data
must be discarded, irreversibly. Recovery can be attempted
by executing 'zpool import -F cstor-7d9da0d6-904b-4310-8d90-3da1aacf4774'. A scrub of the pool
is strongly recommended after recovery.
: exit status 1
The user had hundreds of restarts on his pods and his node went down a couple of times.
The suspected reason for the lock not working is that the path is not same for csp and cspi deployments
Describe the bug: Migration failure ended into data corruption.
Expected behaviour: Even if the migration failed the pool should have been renamed and data should not be corrupted.
Steps to reproduce the bug:
The suspected reason for the lock not working is that the path is not same for csp and cspi deployments
To track we can follow up with this thread: https://kubernetes.slack.com/archives/CUAKPFU78/p1608665319368100
Environment details:
kubectl get po -n openebs --show-labels
): 2.4.0kubectl version
): 1.18cat /etc/os-release
): Centosuname -a
):The text was updated successfully, but these errors were encountered: