glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

agronaught · 2023-11-02T23:46:11Z

Description of problem:
We are seeing the occasional core dump with a brick going offline in a replica-3 file system. This is potentially load related, and it would be a very short and sharp load spike if this is indeed related to the core issue.

This was previously occurring weekly when bitrot checking was enabled, since disabling bitrot checks the crash rate is now anecdotally monthly. I have not yet managed to reproduce this error outside the production environment.

This is the first time since disabling bitrot checking and the first time we've had a core file generated.

potentially related to: #4241

The exact command to reproduce the issue:
I have not yet managed to reproduce this error outside the production environment.

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

Volume Name: srg-data
Type: Replicate
Volume ID: a50cf1b3-869e-4af3-b0f3-452fef6c4f48
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: labhv001.uk.makoglobal.com:/export/gluster/srg/data/brick1
Brick2: labhv002.uk.makoglobal.com:/export/gluster/srg/data/brick1
Brick3: labhv003.uk.makoglobal.com:/export/gluster/srg/data/brick1
Options Reconfigured:
auth.allow: 10.39.*.*
performance.readdir-ahead: off
server.event-threads: 10
server.root-squash: off
performance.nl-cache-timeout: 30
performance.parallel-readdir: off
performance.md-cache-timeout: 30
performance.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-size: 128MB
network.inode-lru-limit: 200000
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
diagnostics.client-sys-log-level: WARNING
diagnostics.client-log-level: WARNING
diagnostics.brick-sys-log-level: INFO
diagnostics.brick-log-level: INFO
cluster.shd-wait-qlength: 4192
cluster.shd-max-threads: 4
client.event-threads: 10
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

- The output of the gluster volume status command:

Status of volume: srg-data
 Gluster process TCP Port RDMA Port Online Pid
 ------------------------------------------------------------------------------
 Brick omxhv001.eu.makoglobal.com:/export/gluster/srg/data/brick1 57471 0 N -
 Brick omxhv002.eu.makoglobal.com:/export/gluster/srg/data/brick1 54143 0 Y 22079
 Brick omxhv003.eu.makoglobal.com:/export/gluster/srg/data/brick1 50211 0 Y 9886 
 Self-heal Daemon on localhost N/A N/A Y 7620 
 Self-heal Daemon on omxhv002.eu.makoglobal.com N/A N/A Y 21590
 Self-heal Daemon on omxhv003.eu.makoglobal.com N/A N/A Y 9236
Task Status of Volume srg-data
 ------------------------------------------------------------------------------
 There are no active volume tasks

- The output of the gluster volume heal command:

gluster volume heal srg-data full
Launching heal operation to perform full self heal on volume srg-data has been successful 
Use heal info commands to check status.


omxhv001 ~]# gluster volume heal srg-data info
Brick omxhv001.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0Brick omxhv002.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0Brick omxhv003.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

[2023-11-02 11:18:35.004610 +0000] I [socket.c:3801:socket_submit_outgoing_msg] 0-tcp.srg-data-server: not connected (priv->connected = -1)
[2023-11-02 11:18:35.004663 +0000] W [rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed
pending frames:
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git

signal received: 11
time of crash: 
2023-11-02 11:18:35 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 10.4
/lib64/libglusterfs.so.0(+0x2b854)[0x7f1a36040854]
/lib64/libglusterfs.so.0(gf_print_trace+0x78d)[0x7f1a36048e6d]
/lib64/libc.so.6(+0x54df0)[0x7f1a35c54df0]
/lib64/libglusterfs.so.0(__gf_free+0x69)[0x7f1a36065119]
/lib64/libgfrpc.so.0(rpc_transport_unref+0x9e)[0x7f1a35fe3fce]
/usr/lib64/glusterfs/10.4/xlator/protocol/server.so(+0xbd25)[0x7f1a3034cd25]
/usr/lib64/glusterfs/10.4/xlator/protocol/server.so(+0xc6d4)[0x7f1a3034d6d4]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/debug/io-stats.so(+0x1a418)[0x7f1a3041f418]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/quota.so(+0x12f62)[0x7f1a3044cf62]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/index.so(+0xaee5)[0x7f1a3046fee5]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/barrier.so(+0x7ce8)[0x7f1a30481ce8]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/performance/io-threads.so(+0x7921)[0x7f1a304be921]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0xe17f)[0x7f1a304d717f]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0xef96)[0x7f1a304d7f96]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x1380c)[0x7f1a304dc80c]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x2bea)[0x7f1a304cbbea]
/usr/lib64/glusterfs/10.4/xlator/features/leases.so(+0x2cdb)[0x7f1a304ebcdb]
/usr/lib64/glusterfs/10.4/xlator/features/locks.so(+0xd69d)[0x7f1a3051e69d]
/lib64/libglusterfs.so.0(default_writev_cbk+0x12b)[0x7f1a360b2beb]
/usr/lib64/glusterfs/10.4/xlator/features/changelog.so/lib64/libglusterfs.so.0(default_writev+0xe6)[0x7f1a360bfe26]
/usr/lib64/glusterfs/10.4/xlator/features/changelog.so(+0x111af)[0x7f1a305971af]
/usr/lib64/glusterfs/10.4/xlator/features/bitrot-stub.so(+0xce74)[0x7f1a30574e74]
/lib64/libglusterfs.so.0(default_writev+0xe6)[0x7f1a360bfe26]
/usr/lib64/glusterfs/10.4/xlator/features/locks.so(+0x133e1)[0x7f1a305243e1]
/usr/lib64/glusterfs/10.4/xlator/features/worm.so(+0x5e9d)[0x7f1a30507e9d]
/usr/lib64/glusterfs/10.4/xlator/features/read-only.so(+0x5096)[0x7f1a3522c096]
/usr/lib64/glusterfs/10.4/xlator/features/leases.so(+0x632a)[0x7f1a304ef32a]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x7784)[0x7f1a304d0784]
/lib64/libglusterfs.so.0(default_writev_resume+0x203)[0x7f1a360bb533]
/lib64/libglusterfs.so.0(call_resume_wind+0x668)[0x7f1a3604d318]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f1a36066de5]
/usr/lib64/glusterfs/10.4/xlator/performance/io-threads.so(+0x6768)[0x7f1a304bd768]
/lib64/libc.so.6(+0x9f802)[0x7f1a35c9f802]
/lib64/libc.so.6(+0x3f450)[0x7f1a35c3f450]
(+0x8719)[0x7f1a3058e719]

**- Is there any crash ? Provide the backtrace and coredump

gdb_thread_full.txt.gz
gdb_bt_full.txt.gz
core file is large (>50MB compressed)

Additional info:

will be disabling performance.cache-invalidation and features.cache-invalidation on this system as it looks to be (potentially) related.

- The operating system / glusterfs version:
Centos 9 kernel 5.14.0-283.el9.x86_64
Gluster 10.4-1
brick FS: ZFS 2.1.9

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

The text was updated successfully, but these errors were encountered:

mohit84 · 2023-11-03T04:35:47Z

@agronaught Thanks for sharing the stacktrace of core to analyze the issue. The brick process is getting a crash due to
not able to handle upcall event at the same time while client disconnect is in progress. For the time being you can disable
upcall notification to avoid a crash, I will send a patch to fix the same.

A brick process may crash while it try to send upcall notification to the client and client disconnect is being process. Solution: Avoid upcall event notification to the client if disconnect is being process for the same client. Fixes: gluster#4255 Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>

agronaught · 2023-11-03T06:51:29Z

thank you very much for this one. that confirms a suspicion and provides a solution.

Cheers.

A brick process may crash while it try to send upcall notification to the client and client disconnect is being process. Solution: Avoid upcall event notification to the client if disconnect is being process for the same client. Fixes: #4255 Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>

A brick process may crash while it try to send upcall notification to the client and client disconnect is being process. Solution: Avoid upcall event notification to the client if disconnect is being process for the same client. > Fixes: gluster#4255 > Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (Cherry picked from commit b98d0d7) > (Reviewed on upstream release gluster#4256) Fixes: gluster#4255 Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>

A brick process may crash while it try to send upcall notification to the client and client disconnect is being process. Solution: Avoid upcall event notification to the client if disconnect is being process for the same client. > Fixes: #4255 > Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (Cherry picked from commit b98d0d7) > (Reviewed on upstream release #4256) Fixes: #4255 Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>

mohit84 mentioned this issue Nov 3, 2023

core: The brick process is getting crash during upcall event #4256

Merged

mohit84 closed this as completed in #4256 Nov 6, 2023

mohit84 reopened this Nov 6, 2023

mohit84 mentioned this issue Nov 6, 2023

core: The brick process is getting crash during upcall event #4258

Merged

mohit84 mentioned this issue Nov 6, 2023

core: The brick process is getting crash during upcall event #4259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

agronaught commented Nov 2, 2023 •

edited

mohit84 commented Nov 3, 2023

agronaught commented Nov 3, 2023

glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

Comments

agronaught commented Nov 2, 2023 • edited

mohit84 commented Nov 3, 2023

agronaught commented Nov 3, 2023

agronaught commented Nov 2, 2023 •

edited