[bug:1787463] Glusterd process is periodically crashing with a segmentation fault #1106

gluster-ant · 2020-03-17T03:22:20Z

URL: https://bugzilla.redhat.com/1787463
Creator: awingerter at opentext
Time: 20200102T23:25:51

Description of problem: Glusterd process is periodically crashing with a segmentation fault. This happens occasionally on some of our nodes. I've been unable to determine a reason.

Dec 18 18:13:53 ch1c7ocvgl01 systemd: glusterd.service: main process exited, code=killed, status=11/SEGV
Dec 18 19:02:49 ch1c7ocvgl01 systemd: glusterd.service: main process exited, code=killed, status=11/SEGV
Dec 19 18:24:15 ch1c7ocvgl01 systemd: glusterd.service: main process exited, code=killed, status=11/SEGV
Dec 21 05:45:39 ch1c7ocvgl01 systemd: glusterd.service: main process exited, code=killed, status=11/SEGV

Version-Release number of selected component (if applicable):

[root@ch1c7ocvgl01 ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

[root@ch1c7ocvgl01 /]# rpm -qa | grep gluster
glusterfs-libs-6.1-1.el7.x86_64
glusterfs-server-6.1-1.el7.x86_64
tendrl-gluster-integration-1.6.3-10.el7.noarch
centos-release-gluster6-1.0-1.el7.centos.noarch
python2-gluster-6.1-1.el7.x86_64
centos-release-gluster5-1.0-1.el7.centos.noarch
glusterfs-api-6.1-1.el7.x86_64
nfs-ganesha-gluster-2.8.2-1.el7.x86_64
glusterfs-client-xlators-6.1-1.el7.x86_64
glusterfs-cli-6.1-1.el7.x86_64
glusterfs-6.1-1.el7.x86_64
glusterfs-fuse-6.1-1.el7.x86_64
glusterfs-events-6.1-1.el7.x86_64

How reproducible:

Unable to reproduce at this time. Issue occurs periodically with an indeterminate cause.

Steps to Reproduce:
N/A

Actual results:
N/A

Expected results:

glusterd should not crash with a segmentation fault.

Additional info:

Several core dumps are located here. Too large to attach.

https://nextcloud.anthonywingerter.net/index.php/s/3n5sSE3SNxfyeyj

Please let me know what further info I can provide.

[root@ch1c7ocvgl01 ~]# gluster volume info

Volume Name: autosfx-prd
Type: Distributed-Replicate
Volume ID: 25e6b3a9-f339-4439-b41e-6084c7527320
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/autosfx/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/autosfx/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/autosfx/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/autosfx/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/autosfx/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/autosfx/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/autosfx/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/autosfx/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/autosfx/brick09 (arbiter)
Options Reconfigured:
nfs.disable: on
performance.client-io-threads: off
transport.address-family: inet
cluster.lookup-optimize: on
performance.stat-prefetch: on
server.event-threads: 16
client.event-threads: 16
performance.cache-invalidation: on
performance.read-ahead: on
storage.fips-mode-rchecksum: on
performance.cache-size: 6GB
features.ctime: on
cluster.self-heal-daemon: enable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
cluster.data-self-heal-algorithm: full
cluster.background-self-heal-count: 256
cluster.rebalance-stats: on
cluster.readdir-optimize: on
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.heal-timeout: 500
cluster.quorum-type: auto
cluster.self-heal-window-size: 2
cluster.self-heal-readdir-size: 2KB
network.ping-timeout: 15
cluster.eager-lock: on
performance.io-thread-count: 16
cluster.shd-max-threads: 64
cluster.shd-wait-qlength: 4096
performance.write-behind-window-size: 8MB
cluster.enable-shared-storage: enable

Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 50e7c3e8-adb9-427f-ae56-c327829a7d34
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl02.covisint.net:/var/lib/glusterd/ss_brick
Brick2: ch1c7ocvgl03.covisint.net:/var/lib/glusterd/ss_brick
Brick3: ch1c7ocvgl01.covisint.net:/var/lib/glusterd/ss_brick
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.enable-shared-storage: enable

Volume Name: hc-pstore-prd
Type: Distributed-Replicate
Volume ID: 1947247c-b3e0-4bd9-b808-011273e45195
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/hc-pstore-prd/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/hc-pstore-prd/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/hc-pstore-prd/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/hc-pstore-prd/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/hc-pstore-prd/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/hc-pstore-prd/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/hc-pstore-prd/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/hc-pstore-prd/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/hc-pstore-prd/brick09 (arbiter)
Options Reconfigured:
auth.allow: exlap1354.covisint.net,exlap1355.covisint.net
performance.write-behind-window-size: 8MB
cluster.shd-wait-qlength: 4096
cluster.shd-max-threads: 64
performance.io-thread-count: 16
cluster.eager-lock: on
network.ping-timeout: 15
cluster.self-heal-readdir-size: 2KB
cluster.self-heal-window-size: 2
cluster.quorum-type: auto
cluster.heal-timeout: 500
cluster.data-self-heal: on
cluster.metadata-self-heal: on
cluster.readdir-optimize: on
cluster.rebalance-stats: on
cluster.background-self-heal-count: 256
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.self-heal-daemon: enable
features.ctime: on
performance.cache-size: 2GB
storage.fips-mode-rchecksum: on
performance.read-ahead: on
performance.cache-invalidation: on
client.event-threads: 8
server.event-threads: 8
performance.stat-prefetch: on
cluster.lookup-optimize: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable

Volume Name: plink-prd
Type: Distributed-Replicate
Volume ID: f146a391-c92e-4965-9026-09f16d2d1c53
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/plink/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/plink/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/plink/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/plink/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/plink/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/plink/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/plink/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/plink/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/plink/brick09 (arbiter)
Options Reconfigured:
nfs.disable: on
performance.client-io-threads: off
transport.address-family: inet
cluster.lookup-optimize: on
performance.stat-prefetch: on
server.event-threads: 16
client.event-threads: 16
performance.cache-invalidation: on
performance.read-ahead: on
storage.fips-mode-rchecksum: on
performance.cache-size: 3800MB
features.ctime: on
cluster.self-heal-daemon: enable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
cluster.data-self-heal-algorithm: full
cluster.background-self-heal-count: 256
cluster.rebalance-stats: on
cluster.readdir-optimize: on
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.heal-timeout: 500
cluster.quorum-type: auto
cluster.self-heal-window-size: 2
cluster.self-heal-readdir-size: 2KB
network.ping-timeout: 15
cluster.eager-lock: on
performance.io-thread-count: 16
cluster.shd-max-threads: 64
cluster.shd-wait-qlength: 4096
performance.write-behind-window-size: 8MB
cluster.enable-shared-storage: enable

Volume Name: pstore-prd
Type: Distributed-Replicate
Volume ID: d77c45ef-19ca-4add-9dac-1bc401244395
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/pstore-prd/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/pstore-prd/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/pstore-prd/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/pstore-prd/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/pstore-prd/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/pstore-prd/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/pstore-prd/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/pstore-prd/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/pstore-prd/brick09 (arbiter)
Options Reconfigured:
cluster.min-free-disk: 1GB
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
performance.stat-prefetch: on
server.event-threads: 16
client.event-threads: 16
performance.cache-invalidation: on
performance.read-ahead: on
storage.fips-mode-rchecksum: on
performance.cache-size: 6GB
features.ctime: on
cluster.self-heal-daemon: enable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
cluster.data-self-heal-algorithm: full
cluster.background-self-heal-count: 256
cluster.rebalance-stats: on
cluster.readdir-optimize: on
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.heal-timeout: 500
cluster.quorum-type: auto
cluster.self-heal-window-size: 2
cluster.self-heal-readdir-size: 2KB
network.ping-timeout: 15
cluster.eager-lock: on
performance.io-thread-count: 16
cluster.shd-max-threads: 64
cluster.shd-wait-qlength: 4096
performance.write-behind-window-size: 8MB
auth.allow: exlap779.covisint.net,exlap780.covisint.net
cluster.enable-shared-storage: enable

Volume Name: rvsshare-prd
Type: Distributed-Replicate
Volume ID: bee2d0f7-9215-4be8-9fc6-302fd568d5ed
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/rvsshare-prd/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/rvsshare-prd/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/rvsshare-prd/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/rvsshare-prd/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/rvsshare-prd/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/rvsshare-prd/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/rvsshare-prd/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/rvsshare-prd/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/rvsshare-prd/brick09 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.lookup-optimize: on
performance.stat-prefetch: on
server.event-threads: 16
client.event-threads: 16
performance.cache-invalidation: on
performance.read-ahead: on
storage.fips-mode-rchecksum: on
performance.cache-size: 6GB
features.ctime: off
cluster.self-heal-daemon: enable
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
diagnostics.brick-log-level: ERROR
diagnostics.client-log-level: ERROR
cluster.data-self-heal-algorithm: full
cluster.background-self-heal-count: 256
cluster.rebalance-stats: on
cluster.readdir-optimize: on
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.heal-timeout: 500
cluster.quorum-type: auto
cluster.self-heal-window-size: 2
cluster.self-heal-readdir-size: 2KB
network.ping-timeout: 15
cluster.eager-lock: on
performance.io-thread-count: 16
cluster.shd-max-threads: 64
cluster.shd-wait-qlength: 4096
performance.write-behind-window-size: 8MB
auth.allow: exlap825.covisint.net,exlap826.covisint.net
cluster.enable-shared-storage: enable

Volume Name: test
Type: Distributed-Replicate
Volume ID: 07c36821-382d-45bd-9f17-e7e48811d2a2
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: ch1c7ocvgl01:/covisint/gluster/test/brick01
Brick2: ch1c7ocvgl02:/covisint/gluster/test/brick02
Brick3: ch1c7ocvga11:/covisint/gluster/test/brick03 (arbiter)
Brick4: ch1c7ocvgl03:/covisint/gluster/test/brick04
Brick5: ch1c7ocvgl04:/covisint/gluster/test/brick05
Brick6: ch1c7ocvga11:/covisint/gluster/test/brick06 (arbiter)
Brick7: ch1c7ocvgl05:/covisint/gluster/test/brick07
Brick8: ch1c7ocvgl06:/covisint/gluster/test/brick08
Brick9: ch1c7ocvga11:/covisint/gluster/test/brick09 (arbiter)
Options Reconfigured:
performance.write-behind-window-size: 8MB
cluster.shd-wait-qlength: 4096
cluster.shd-max-threads: 64
performance.io-thread-count: 16
cluster.eager-lock: on
network.ping-timeout: 15
cluster.self-heal-readdir-size: 2KB
cluster.self-heal-window-size: 2
cluster.quorum-type: auto
cluster.heal-timeout: 500
cluster.data-self-heal: on
cluster.metadata-self-heal: on
cluster.readdir-optimize: on
cluster.rebalance-stats: on
cluster.background-self-heal-count: 256
cluster.data-self-heal-algorithm: full
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.self-heal-daemon: enable
performance.cache-size: 2GB
storage.fips-mode-rchecksum: on
performance.read-ahead: on
performance.cache-invalidation: on
client.event-threads: 16
server.event-threads: 16
performance.stat-prefetch: on
cluster.lookup-optimize: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable

[root@ch1c7ocvgl01 ~]# gluster volume status
Status of volume: autosfx-prd
Gluster process TCP Port RDMA Port Online Pid

Brick ch1c7ocvgl01:/covisint/gluster/autosf
x/brick01 49152 0 Y 8316
Brick ch1c7ocvgl02:/covisint/gluster/autosf
x/brick02 49152 0 Y 8310
Brick ch1c7ocvga11:/covisint/gluster/autosf
x/brick03 49152 0 Y 8688
Brick ch1c7ocvgl03:/covisint/gluster/autosf
x/brick04 49152 0 Y 8388
Brick ch1c7ocvgl04:/covisint/gluster/autosf
x/brick05 49152 0 Y 7705
Brick ch1c7ocvga11:/covisint/gluster/autosf
x/brick06 49153 0 Y 8689
Brick ch1c7ocvgl05:/covisint/gluster/autosf
x/brick07 49152 0 Y 8128
Brick ch1c7ocvgl06:/covisint/gluster/autosf
x/brick08 49152 0 Y 7811
Brick ch1c7ocvga11:/covisint/gluster/autosf
x/brick09 49154 0 Y 8690
Self-heal Daemon on localhost N/A N/A Y 15133
Self-heal Daemon on ch1c7ocvgl05.covisint.n
et N/A N/A Y 13966
Self-heal Daemon on ch1c7ocvgl04.covisint.n
et N/A N/A Y 25439
Self-heal Daemon on ch1c7ocvgl03.covisint.n
et N/A N/A Y 27470
Self-heal Daemon on ch1c7ocvga11.covisint.n
et N/A N/A Y 4772
Self-heal Daemon on ch1c7ocvgl02 N/A N/A Y 30524
Self-heal Daemon on ch1c7ocvgl06.covisint.n
et N/A N/A Y 10152