Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to delete the blocks due to the data couldn't flushed in time #204

Closed
lxbsz opened this issue Apr 15, 2019 · 3 comments · May be fixed by open-iscsi/tcmu-runner#546
Closed

Failed to delete the blocks due to the data couldn't flushed in time #204

lxbsz opened this issue Apr 15, 2019 · 3 comments · May be fixed by open-iscsi/tcmu-runner#546

Comments

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 15, 2019

Kind of issue

Bug

Observed behavior

When deleting the block volume it failed and the logs like:

# gluster-block delete hosting-volume/block-volume
SUCCESSFUL ON:   192.168.195.164
RESULT: FAIL

Expected/desired behavior

Delete successfully.

Details on how to reproduce (minimal and precise)

Create and then delete the blocks in the bash script.

Logs and Information about the environment:

0467 [2019-04-15 04:56:13.909267] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_      svc_routines.c+2971 :<glusterBlockCleanUp>]
10468 [2019-04-15 04:56:13.909282] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4687 :<block_delete_cli_1      _svc_st>]

By adding more logs manually we can see that, from the metafile it seems the delete not success:

10454 [2019-04-15 04:56:13.906077] INFO: command exit code, 0 [at block_svc_routines.c+4761 :<block_delete_1_svc_st>]
10455 [2019-04-15 04:56:13.908336] INFO: VOLUME: hosting-volume [at glfs-operations.c+708 :<blockGetMetaInfo>]
10456 [2019-04-15 04:56:13.908423] INFO: GBID: e1c70c46-8c60-433d-8cea-05b1e8e77694 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10457 [2019-04-15 04:56:13.908490] INFO: HA: 1 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10458 [2019-04-15 04:56:13.908556] INFO: ENTRYCREATE: INPROGRESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10459 [2019-04-15 04:56:13.908618] INFO: PRIOPATH: 192.168.195.164 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10460 [2019-04-15 04:56:13.908737] INFO: SIZE: 1048576 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10461 [2019-04-15 04:56:13.908821] INFO: RINGBUFFER: 32 [at glfs-operations.c+708 :<blockGetMetaInfo>]
10462 [2019-04-15 04:56:13.908889] INFO: ENTRYCREATE: SUCCESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10463 [2019-04-15 04:56:13.908956] INFO: 192.168.195.164: CONFIGINPROGRESS [at glfs-operations.c+708 :<blockGetMetaInfo>]
10464 [2019-04-15 04:56:13.909021] INFO: 192.168.195.164: CONFIGSUCCESS [at glfs-operations.c+708 :<blockGetMetaInfo>]                                                       
10465 [2019-04-15 04:56:13.909204] INFO: info->nhosts: 1, cleanupsuccess: 0, info_new->list[i]->status[CONFIGSUCCESS] [at block_svc_routines.c+1292 :<glusterBlockDeleteRemot      eAsync>]
10466 [2019-04-15 04:56:13.909241] INFO: info->nhosts: 1, cleanupsuccess: 0 [at block_svc_routines.c+1302 :<glusterBlockDeleteRemoteAsync>]
10467 [2019-04-15 04:56:13.909267] WARNING: glusterBlockDeleteRemoteAsync: return -1 failed in remote async delete for block block-volume on volume hosting-volume [at block_      svc_routines.c+2971 :<glusterBlockCleanUp>]
10468 [2019-04-15 04:56:13.909282] WARNING: glusterBlockCleanUp: return -1 on block block-volume for volume hosting-volume [at block_svc_routines.c+4687 :<block_delete_cli_1      _svc_st>]

But from the metafile from the mountpoint, it is already succfully deleted.
# cat /mnt/block-meta/block-volume
  1 VOLUME: hosting-volume
  2 GBID: 2092dee3-a5d3-4823-ae6e-84bb9cae8bbf
  3 HA: 1
  4 ENTRYCREATE: INPROGRESS
  5 PRIOPATH: 192.168.195.164
  6 SIZE: 1048576
  7 RINGBUFFER: 0
  8 ENTRYCREATE: SUCCESS
  9 192.168.195.164: CONFIGINPROGRESS
 10 192.168.195.164: CONFIGSUCCESS
 11 192.168.195.164: CLEANUPINPROGRESS
 12 192.168.195.164: CLEANUPSUCCESS  

Other useful information

@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 15, 2019

@amarts @pkalever

It seems the glfs_fsync and glfs_fdatasync do not work here also.

This issue is:

pthread_create(&tid, glusterBlockDeleteRemote, ); // in glusterBlockDeleteRemote() it will update the metadata file in block-meta/metafile

pthread_join(tid);

blockGetMetaInfo(); // here will read the block-meta/metafile, but it reads the old data sometimes.

And I have tried to glfs_fsync/fdatasync it in glusterBlockDeleteRemote(), but still didn't work.

And ideas ?

lxbsz added a commit to lxbsz/gluster-block that referenced this issue Apr 16, 2019
For some reasons, after updating the metafile then when read it
immediately, we could find it won't be updated in time, then though
the deletions are all successful, but it will still return as failure.

Currently to check the ->exit status is enough.

For the case:
* When gluster-block delete was executed for the first time, on this
  node, deletion was successful. But before it can send the response,
  gluster-blockd died (addr meta status would be CLEANUPFAIL)

But for this case we can also check this from ->exit status, not need
to check it from the metafile.

The deletion failure has also be seen in customer cases.

Fixes: gluster#204
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

For the modify I can also hit the same issue in the same client with multi-precesses, here is the gluster-blockd and tcmu-runner.

We do glfs_ftruncate in gluster-blockd and then in tcmu-runner to check the size by using the glfs_lstat with getting the old size, but from the mountpoint's ls command, we can see that the size is already upated to new one.

lxbsz added a commit to lxbsz/tcmu-runner that referenced this issue Apr 16, 2019
For the 'gluster-block modify' we can hit one issue in the same
client with multi-precesses, here is the gluster-blockd and
tcmu-runner.

We do glfs_ftruncate in gluster-blockd to resize the volume file
and then in tcmu-runner to check the size by using the glfs_lstat
but get the old size, which hasn't been updated yet.

But from the mountpoint's ls command, we can see that the size is
already upated to new one.

Here we will went to wait for most 5 seconds and retry 5 times to
make sure the cache has been flushed successfully to the volume.

Fixes: gluster/gluster-block#204
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@lxbsz
Copy link
Collaborator Author

lxbsz commented Apr 16, 2019

For the modify I can also hit the same issue in the same client with multi-precesses, here is the gluster-blockd and tcmu-runner.

We do glfs_ftruncate in gluster-blockd and then in tcmu-runner to check the size by using the glfs_lstat with getting the old size, but from the mountpoint's ls command, we can see that the size is already upated to new one.

Fixed this in tcmu-runner: open-iscsi/tcmu-runner#546.

@lxbsz lxbsz changed the title Failed to delete the blocks Failed to delete the blocks due to the data couldn't flushed in time Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants