Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glusterfsd 10.4 core dump in __gf_free - potetially related to cache invalidation #4255

Open
agronaught opened this issue Nov 2, 2023 · 2 comments · Fixed by #4256
Open

Comments

@agronaught
Copy link

agronaught commented Nov 2, 2023

Description of problem:
We are seeing the occasional core dump with a brick going offline in a replica-3 file system. This is potentially load related, and it would be a very short and sharp load spike if this is indeed related to the core issue.

This was previously occurring weekly when bitrot checking was enabled, since disabling bitrot checks the crash rate is now anecdotally monthly. I have not yet managed to reproduce this error outside the production environment.

This is the first time since disabling bitrot checking and the first time we've had a core file generated.

potentially related to: #4241

The exact command to reproduce the issue:
I have not yet managed to reproduce this error outside the production environment.

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

Volume Name: srg-data
Type: Replicate
Volume ID: a50cf1b3-869e-4af3-b0f3-452fef6c4f48
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: labhv001.uk.makoglobal.com:/export/gluster/srg/data/brick1
Brick2: labhv002.uk.makoglobal.com:/export/gluster/srg/data/brick1
Brick3: labhv003.uk.makoglobal.com:/export/gluster/srg/data/brick1
Options Reconfigured:
auth.allow: 10.39.*.*
performance.readdir-ahead: off
server.event-threads: 10
server.root-squash: off
performance.nl-cache-timeout: 30
performance.parallel-readdir: off
performance.md-cache-timeout: 30
performance.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-size: 128MB
network.inode-lru-limit: 200000
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
diagnostics.client-sys-log-level: WARNING
diagnostics.client-log-level: WARNING
diagnostics.brick-sys-log-level: INFO
diagnostics.brick-log-level: INFO
cluster.shd-wait-qlength: 4192
cluster.shd-max-threads: 4
client.event-threads: 10
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

- The output of the gluster volume status command:

Status of volume: srg-data
 Gluster process TCP Port RDMA Port Online Pid
 ------------------------------------------------------------------------------
 Brick omxhv001.eu.makoglobal.com:/export/gluster/srg/data/brick1 57471 0 N -
 Brick omxhv002.eu.makoglobal.com:/export/gluster/srg/data/brick1 54143 0 Y 22079
 Brick omxhv003.eu.makoglobal.com:/export/gluster/srg/data/brick1 50211 0 Y 9886 
 Self-heal Daemon on localhost N/A N/A Y 7620 
 Self-heal Daemon on omxhv002.eu.makoglobal.com N/A N/A Y 21590
 Self-heal Daemon on omxhv003.eu.makoglobal.com N/A N/A Y 9236
Task Status of Volume srg-data
 ------------------------------------------------------------------------------
 There are no active volume tasks

- The output of the gluster volume heal command:

gluster volume heal srg-data full
Launching heal operation to perform full self heal on volume srg-data has been successful 
Use heal info commands to check status.


omxhv001 ~]# gluster volume heal srg-data info
Brick omxhv001.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0Brick omxhv002.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0Brick omxhv003.eu.makoglobal.com:/export/gluster/srg/data/brick1
Status: Connected
Number of entries: 0 

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

[2023-11-02 11:18:35.004610 +0000] I [socket.c:3801:socket_submit_outgoing_msg] 0-tcp.srg-data-server: not connected (priv->connected = -1)
[2023-11-02 11:18:35.004663 +0000] W [rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed
pending frames:
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git

signal received: 11
time of crash: 
2023-11-02 11:18:35 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 10.4
/lib64/libglusterfs.so.0(+0x2b854)[0x7f1a36040854]
/lib64/libglusterfs.so.0(gf_print_trace+0x78d)[0x7f1a36048e6d]
/lib64/libc.so.6(+0x54df0)[0x7f1a35c54df0]
/lib64/libglusterfs.so.0(__gf_free+0x69)[0x7f1a36065119]
/lib64/libgfrpc.so.0(rpc_transport_unref+0x9e)[0x7f1a35fe3fce]
/usr/lib64/glusterfs/10.4/xlator/protocol/server.so(+0xbd25)[0x7f1a3034cd25]
/usr/lib64/glusterfs/10.4/xlator/protocol/server.so(+0xc6d4)[0x7f1a3034d6d4]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/debug/io-stats.so(+0x1a418)[0x7f1a3041f418]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/quota.so(+0x12f62)[0x7f1a3044cf62]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/index.so(+0xaee5)[0x7f1a3046fee5]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/barrier.so(+0x7ce8)[0x7f1a30481ce8]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/performance/io-threads.so(+0x7921)[0x7f1a304be921]
/lib64/libglusterfs.so.0(xlator_notify+0x38)[0x7f1a360339a8]
/lib64/libglusterfs.so.0(default_notify+0x1ab)[0x7f1a360c6fdb]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0xe17f)[0x7f1a304d717f]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0xef96)[0x7f1a304d7f96]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x1380c)[0x7f1a304dc80c]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x2bea)[0x7f1a304cbbea]
/usr/lib64/glusterfs/10.4/xlator/features/leases.so(+0x2cdb)[0x7f1a304ebcdb]
/usr/lib64/glusterfs/10.4/xlator/features/locks.so(+0xd69d)[0x7f1a3051e69d]
/lib64/libglusterfs.so.0(default_writev_cbk+0x12b)[0x7f1a360b2beb]
/usr/lib64/glusterfs/10.4/xlator/features/changelog.so/lib64/libglusterfs.so.0(default_writev+0xe6)[0x7f1a360bfe26]
/usr/lib64/glusterfs/10.4/xlator/features/changelog.so(+0x111af)[0x7f1a305971af]
/usr/lib64/glusterfs/10.4/xlator/features/bitrot-stub.so(+0xce74)[0x7f1a30574e74]
/lib64/libglusterfs.so.0(default_writev+0xe6)[0x7f1a360bfe26]
/usr/lib64/glusterfs/10.4/xlator/features/locks.so(+0x133e1)[0x7f1a305243e1]
/usr/lib64/glusterfs/10.4/xlator/features/worm.so(+0x5e9d)[0x7f1a30507e9d]
/usr/lib64/glusterfs/10.4/xlator/features/read-only.so(+0x5096)[0x7f1a3522c096]
/usr/lib64/glusterfs/10.4/xlator/features/leases.so(+0x632a)[0x7f1a304ef32a]
/usr/lib64/glusterfs/10.4/xlator/features/upcall.so(+0x7784)[0x7f1a304d0784]
/lib64/libglusterfs.so.0(default_writev_resume+0x203)[0x7f1a360bb533]
/lib64/libglusterfs.so.0(call_resume_wind+0x668)[0x7f1a3604d318]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f1a36066de5]
/usr/lib64/glusterfs/10.4/xlator/performance/io-threads.so(+0x6768)[0x7f1a304bd768]
/lib64/libc.so.6(+0x9f802)[0x7f1a35c9f802]
/lib64/libc.so.6(+0x3f450)[0x7f1a35c3f450]
(+0x8719)[0x7f1a3058e719]

**- Is there any crash ? Provide the backtrace and coredump

gdb_thread_full.txt.gz
gdb_bt_full.txt.gz
core file is large (>50MB compressed)

Additional info:

will be disabling performance.cache-invalidation and features.cache-invalidation on this system as it looks to be (potentially) related.

- The operating system / glusterfs version:
Centos 9 kernel 5.14.0-283.el9.x86_64
Gluster 10.4-1
brick FS: ZFS 2.1.9

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

@mohit84
Copy link
Contributor

mohit84 commented Nov 3, 2023

@agronaught Thanks for sharing the stacktrace of core to analyze the issue. The brick process is getting a crash due to
not able to handle upcall event at the same time while client disconnect is in progress. For the time being you can disable
upcall notification to avoid a crash, I will send a patch to fix the same.

mohit84 added a commit to mohit84/glusterfs that referenced this issue Nov 3, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

Fixes: gluster#4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@agronaught
Copy link
Author

thank you very much for this one. that confirms a suspicion and provides a solution.

Cheers.

mohit84 added a commit that referenced this issue Nov 6, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

Fixes: #4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@mohit84 mohit84 reopened this Nov 6, 2023
mohit84 added a commit to mohit84/glusterfs that referenced this issue Nov 6, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

> Fixes: gluster#4255
> Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit b98d0d7)
> (Reviewed on upstream release gluster#4256)

Fixes: gluster#4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit to mohit84/glusterfs that referenced this issue Nov 6, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

> Fixes: gluster#4255
> Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit b98d0d7)
> (Reviewed on upstream release gluster#4256)

Fixes: gluster#4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit to mohit84/glusterfs that referenced this issue Nov 6, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

> Fixes: gluster#4255
> Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit b98d0d7)
> (Reviewed on upstream release gluster#4256)

Fixes: gluster#4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Shwetha-Acharya pushed a commit that referenced this issue Nov 6, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

> Fixes: #4255
> Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit b98d0d7)
> (Reviewed on upstream release #4256)

Fixes: #4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit that referenced this issue Nov 10, 2023
A brick process may crash while it try to send upcall notification
to the client and client disconnect is being process.

Solution: Avoid upcall event notification to the client if disconnect
        is being process for the same client.

> Fixes: #4255
> Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit b98d0d7)
> (Reviewed on upstream release #4256)

Fixes: #4255
Change-Id: I80478d7f4a038b04a10fb21a1290b4309e9fe4dd

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants