New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug:1618932] dht-selfheal.c: Directory selfheal failed #957
Comments
Time: 20180818T12:52:01 |
Time: 20180818T13:00:04 no lock found when doing inode unlock, it seems unblock request was from dht self-heal |
Time: 20180818T13:12:37 there was many "has not responded in the last 120 seconds, disconnecting." in it |
Time: 20180820T04:47:39 |
Time: 20180820T04:52:59 |
Time: 20180820T04:56:16 |
Time: 20180820T13:38:32 I have triggered state dump, attach them later. |
Time: 20180820T13:42:04 Another offline cluster which upgraded from 3.7.6 to 3.10.12. "gluster volum heal info hang" after stress test. These are state dump file. |
Time: 20180821T08:53:56 |
Time: 20180821T12:15:37 |
Time: 20180914T22:46:12 It may lead to blocking client.event-thread. The situation was described in https://lists.gluster.org/pipermail/gluster-users/2018-September/034871.html commit 086f1d0
commit 94faf8c
|
Time: 20180920T07:20:25 If it only happens with heal info, will it be possible for you to help find the RC? we need to check if there is any dead-lock in the heal-info process. Pranith |
Time: 20180920T12:26:27 Mount by fuse were instant. Write throught gfapi are OK most of the time. It turned out that "heal info" problem was not related to IO error problem. Our heals are all normal for now (reverted the two commit I pointed out earlier). |
Time: 20180920T12:39:59
yes, if there are future patches I'm happy to test them out |
Time: 20180920T12:49:34 using v3.12.11, change log to trace level, with some attitional debug log added. |
Time: 20180920T12:50:41 with version 3.12.1, sucessful run |
Time: 20180920T12:59:43
Ignore this, I checked the log it was a bug not related to gluster |
Time: 20180920T14:00:45 commit 94faf8c added two events to fire, but EVENT_AFR_SUBVOL_UP/EVENT_AFR_SUBVOLS_DOWN are not processed in any where of the code. Commit 086f1d0 removed some variable initialization code, seams a regression. |
Time: 20180926T22:48:27 New sample of gfapi log in version 3.12.14. IO error happended while creating file |
Time: 20181023T14:55:03 |
Time: 20181029T08:51:52 |
Time: 20181029T11:29:28 Can you set client-log-level to DEBUG and send us the logs once you hit this? Or better yet, is there a test program I can use to try this out? |
Time: 20181029T12:20:23
Just noticed that trace logs are already provided. I will take a look and get back. |
Time: 20181120T08:34:41 |
Time: 20191104T06:30:37
The logs indicate that the client version is 3.10. This could be because of https://bugzilla.redhat.com/show_bug.cgi?id=1455104 which was fixed in release 3.12. Do you still see the I/O errors with 3.12 or later? |
Time: 20191104T07:50:00
I do not see IO errors in the gfapi log. Please provide debug/trace logs when the issue is seen in 3.12 |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
URL: https://bugzilla.redhat.com/1618932
Creator: frostyplanet at gmail
Time: 20180818T12:49:41
Created attachment 1476762
gfapi log
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
There's mulitple application using gfapi concurrently creating file in the same directory
(e51fd83622674cc9) and (e21ea6832d2b13d0) are log from different application processes.
application log
timezone is GMT+8
2018-08-18 19:35:03,703 DEBUG -31021968- writing to file cluster=4 FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88 (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:03,734 DEBUG -32369552- writing to file cluster=4 FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88 (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:03,786 DEBUG -31022448- Create new directory [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m] on cluster [4] ((unknown file): 0)(e51fd83622674cc9)
2018-08-18 19:35:03,795 CRITICAL -31021968- Failed to open cluster [4] object [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/ 0004_bfab2d1ea2da11e8a3196c92bf5c1b88] with mode [w]: [[Errno 5] Input/output error] (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:03,903 DEBUG -32366672- Directory [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m] exists on cluster [4] ((unknown file): 0)(e21ea6832d2b13d0)
2018-08-18 19:35:03,945 DEBUG -32369552- Open cluster [4] file [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88] with mode [w] (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:04,127 DEBUG -31021968- Open cluster [4] file [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88] with mode [w] (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:04,391 INFO -32369552- Rename file: cluster=4 src=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88 dst=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001 (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:04,485 INFO -31021968- Rename file: cluster=4 src=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88 dst=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004 (app:1461)(e51fd83622674cc9)
Actual results:
IO error happended when creating file, success after retry
dht-selfheal failure is observed in gfapi log, there is unmatched inode unlock
request reported from brick.
Expected results:
Additional info:
"gluster volume status" output is all ok,
but runing "gluster volume heal vol0 info" blocks and no output
gluster volume info
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 18e1c05d-570a-4c97-aa91-ef984881c4f2
Status: Started
Snapshot Count: 0
Number of Bricks: 36 x 3 = 108
Transport-type: tcp
Options Reconfigured:
locks.trace: false
client.event-threads: 6
cluster.self-heal-daemon: enable
performance.write-behind: True
transport.keepalive: True
cluster.rebal-throttle: lazy
server.event-threads: 4
performance.io-cache: False
nfs.disable: True
cluster.quorum-type: auto
network.ping-timeout: 120
features.cache-invalidation: False
performance.read-ahead: False
performance.client-io-threads: True
cluster.server-quorum-type: none
performance.md-cache-timeout: 0
performance.readdir-ahead: True
The text was updated successfully, but these errors were encountered: