Failed to delete the blocks due to the data couldn't flushed in time #204

lxbsz · 2019-04-15T05:03:52Z

Kind of issue

Bug

Observed behavior

When deleting the block volume it failed and the logs like:

# gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: FAIL

Expected/desired behavior

Delete successfully.

Details on how to reproduce (minimal and precise)

Create and then delete the blocks in the bash script.

Logs and Information about the environment:

0467 [2019-04-15 04:56:13.909267] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_      svc_routines.c+2971 :<glusterBlockCleanUp>]
10468 [2019-04-15 04:56:13.909282] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4687 :<block_delete_cli_1      _svc_st>]

By adding more logs manually we can see that, from the metafile it seems the delete not success:

10454 [2019-04-15 04:56:13.906077] INFO: command exit code, 0 [at block_svc_routines.c+4761 :<block_delete_1_svc_st>]
10455 [2019-04-15 04:56:13.908336] INFO: VOLUME: hosting-volume [at glfs-operations.c+708 :<blockGetMetaInfo>]
10456 [2019-04-15 04:56:13.908423] INFO: GBID: e1c70c46-8c60-433d-8cea-05b1e8e77694 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10457 [2019-04-15 04:56:13.908490] INFO: HA: 1 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10458 [2019-04-15 04:56:13.908556] INFO: ENTRYCREATE: INPROGRESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10459 [2019-04-15 04:56:13.908618] INFO: PRIOPATH: 192.168.195.164 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10460 [2019-04-15 04:56:13.908737] INFO: SIZE: 1048576 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10461 [2019-04-15 04:56:13.908821] INFO: RINGBUFFER: 32 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10462 [2019-04-15 04:56:13.908889] INFO: ENTRYCREATE: SUCCESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10463 [2019-04-15 04:56:13.908956] INFO: 192.168.195.164: CONFIGINPROGRESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10464 [2019-04-15 04:56:13.909021] INFO: 192.168.195.164: CONFIGSUCCESS [at glfs-operations.c+708 :<blockGetMetaInfo>]                                                       
10465 [2019-04-15 04:56:13.909204] INFO: info->nhosts: 1, cleanupsuccess: 0, info_new->list[i]->status[CONFIGSUCCESS] [at block_svc_routines.c+1292 :<glusterBlockDeleteRemot      eAsync>]
10466 [2019-04-15 04:56:13.909241] INFO: info->nhosts: 1, cleanupsuccess: 0 [at block_svc_routines.c+1302 :<glusterBlockDeleteRemoteAsync>]
10467 [2019-04-15 04:56:13.909267] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_      svc_routines.c+2971 :<glusterBlockCleanUp>]
10468 [2019-04-15 04:56:13.909282] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4687 :<block_delete_cli_1      _svc_st>]

But from the metafile from the mountpoint, it is already succfully deleted.
# cat /mnt/block-meta/block-volume
  1 VOLUME: hosting-volume
  2 GBID: 2092dee3-a5d3-4823-ae6e-84bb9cae8bbf
  3 HA: 1
  4 ENTRYCREATE: INPROGRESS
  5 PRIOPATH: 192.168.195.164
  6 SIZE: 1048576
  7 RINGBUFFER: 0
  8 ENTRYCREATE: SUCCESS
  9 192.168.195.164: CONFIGINPROGRESS
 10 192.168.195.164: CONFIGSUCCESS
 11 192.168.195.164: CLEANUPINPROGRESS
 12 192.168.195.164: CLEANUPSUCCESS

Other useful information

The text was updated successfully, but these errors were encountered:

lxbsz · 2019-04-15T07:21:19Z

@amarts @pkalever

It seems the glfs_fsync and glfs_fdatasync do not work here also.

This issue is:

pthread_create(&tid, glusterBlockDeleteRemote, ); // in glusterBlockDeleteRemote() it will update the metadata file in block-meta/metafile

pthread_join(tid);

blockGetMetaInfo(); // here will read the block-meta/metafile, but it reads the old data sometimes.

And I have tried to glfs_fsync/fdatasync it in glusterBlockDeleteRemote(), but still didn't work.

And ideas ?

For some reasons, after updating the metafile then when read it immediately, we could find it won't be updated in time, then though the deletions are all successful, but it will still return as failure. Currently to check the ->exit status is enough. For the case: * When gluster-block delete was executed for the first time, on this node, deletion was successful. But before it can send the response, gluster-blockd died (addr meta status would be CLEANUPFAIL) But for this case we can also check this from ->exit status, not need to check it from the metafile. The deletion failure has also be seen in customer cases. Fixes: gluster#204 Signed-off-by: Xiubo Li <xiubli@redhat.com>

lxbsz · 2019-04-16T04:50:47Z

For the modify I can also hit the same issue in the same client with multi-precesses, here is the gluster-blockd and tcmu-runner.

We do glfs_ftruncate in gluster-blockd and then in tcmu-runner to check the size by using the glfs_lstat with getting the old size, but from the mountpoint's ls command, we can see that the size is already upated to new one.

For the 'gluster-block modify' we can hit one issue in the same client with multi-precesses, here is the gluster-blockd and tcmu-runner. We do glfs_ftruncate in gluster-blockd to resize the volume file and then in tcmu-runner to check the size by using the glfs_lstat but get the old size, which hasn't been updated yet. But from the mountpoint's ls command, we can see that the size is already upated to new one. Here we will went to wait for most 5 seconds and retry 5 times to make sure the cache has been flushed successfully to the volume. Fixes: gluster/gluster-block#204 Signed-off-by: Xiubo Li <xiubli@redhat.com>

lxbsz · 2019-04-16T05:01:59Z

For the modify I can also hit the same issue in the same client with multi-precesses, here is the gluster-blockd and tcmu-runner.

We do glfs_ftruncate in gluster-blockd and then in tcmu-runner to check the size by using the glfs_lstat with getting the old size, but from the mountpoint's ls command, we can see that the size is already upated to new one.

Fixed this in tcmu-runner: open-iscsi/tcmu-runner#546.

lxbsz mentioned this issue Apr 16, 2019

Delete: remove the metafile check #209

Closed

lxbsz mentioned this issue Apr 16, 2019

glfs: retry 5 times to wait the meta data updated successfully open-iscsi/tcmu-runner#546

Open

lxbsz changed the title ~~Failed to delete the blocks~~ Failed to delete the blocks due to the data couldn't flushed in time Jul 16, 2019

lxbsz mentioned this issue Jul 16, 2019

create: fails directly when the StorageObject is already exist with the same NAME #243

Merged

pkalever closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to delete the blocks due to the data couldn't flushed in time #204

Failed to delete the blocks due to the data couldn't flushed in time #204

lxbsz commented Apr 15, 2019

lxbsz commented Apr 15, 2019

lxbsz commented Apr 16, 2019

lxbsz commented Apr 16, 2019

Failed to delete the blocks due to the data couldn't flushed in time #204

Failed to delete the blocks due to the data couldn't flushed in time #204

Comments

lxbsz commented Apr 15, 2019

Kind of issue

Observed behavior

Expected/desired behavior

Details on how to reproduce (minimal and precise)

Logs and Information about the environment:

Other useful information

lxbsz commented Apr 15, 2019

lxbsz commented Apr 16, 2019

lxbsz commented Apr 16, 2019