New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug:1773532] Gluster brick randomly segfaults #861
Comments
Time: 20191118T12:16:07 |
Time: 20191118T12:16:36 |
Time: 20191118T12:19:17 |
Time: 20191118T12:20:53
|
Time: 20191118T12:30:17 Gluster logs from node01 |
Time: 20191118T12:30:54 Gluster logs from node02 |
Time: 20191118T12:31:20 Gluster logs for node03 |
Time: 20191118T12:35:14 |
Time: 20191118T13:34:41
|
Time: 20191119T09:02:55 @dominik Are you using 4kN drive? If not then I wonder whether 4kN code is giving trouble. [Invalid argument] could be because of 4kN. |
Time: 20191119T09:32:25 fdisk reports as follows: |
Time: 20191120T07:04:07 |
Time: 20191120T07:37:31 |
Time: 20191121T06:01:23 |
Time: 20191125T07:02:18 Can you please provide us output of "bt" and "t a a bt" from the corefile? That helps us investigating this issue faster. If possible, do share the core file. Thanks, |
Time: 20191126T07:57:21 Kind regards, |
Time: 20191126T08:06:01 In my case, I have something like below. So, my core files will be stored under /root/cores/ [root@localhost glusterfs]# cat /etc/sysctl.conf Own core file pattern...kernel.core_pattern=/root/cores/core.%e.%p.%h.%t HTH, |
Time: 20191126T08:25:22 (gdb) bt (gdb) t a a bt Thread 40 (Thread 0x7f7d184bf4c0 (LWP 138985)): Thread 39 (Thread 0x7f7d000ac700 (LWP 140549)): Thread 38 (Thread 0x7f7cf46b3700 (LWP 140758)): Thread 37 (Thread 0x7f7cf4431700 (LWP 140760)): Thread 36 (Thread 0x7f7d0d59b700 (LWP 138991)): Thread 35 (Thread 0x7f7cf4472700 (LWP 140759)): Thread 34 (Thread 0x7f7d0ed9e700 (LWP 138988)): Thread 33 (Thread 0x7f7d0dd9c700 (LWP 138990)): Thread 32 (Thread 0x7f7d0cd9a700 (LWP 138992)): Thread 31 (Thread 0x7f7d000ed700 (LWP 140548)): Thread 30 (Thread 0x7f7d0016f700 (LWP 139188)): Thread 29 (Thread 0x7f7d008f8700 (LWP 139011)): Thread 28 (Thread 0x7f7d0e59d700 (LWP 138989)): Thread 27 (Thread 0x7f7d0012e700 (LWP 139189)): Thread 26 (Thread 0x7f7cedffb700 (LWP 139134)): Thread 25 (Thread 0x7f7d0f59f700 (LWP 138987)): Thread 24 (Thread 0x7f7cef7fe700 (LWP 139114)): Thread 23 (Thread 0x7f7d001f1700 (LWP 139118)): Thread 22 (Thread 0x7f7cf46f4700 (LWP 140757)): Thread 21 (Thread 0x7f7cf47b7700 (LWP 140552)): Thread 20 (Thread 0x7f7cf4735700 (LWP 140756)): Thread 19 (Thread 0x7f7cf4776700 (LWP 140755)): Thread 18 (Thread 0x7f7cf47f8700 (LWP 140551)): Thread 17 (Thread 0x7f7d0006b700 (LWP 140550)): Thread 16 (Thread 0x7f7d001b0700 (LWP 139149)): Thread 15 (Thread 0x7f7cee7fc700 (LWP 139126)): Thread 14 (Thread 0x7f7ceeffd700 (LWP 139117)): Thread 13 (Thread 0x7f7cf7fff700 (LWP 139013)): Thread 12 (Thread 0x7f7ceffff700 (LWP 139020)): Thread 11 (Thread 0x7f7cf4ff9700 (LWP 139019)): Thread 10 (Thread 0x7f7cf5ffb700 (LWP 139017)): Thread 9 (Thread 0x7f7d080f7700 (LWP 139012)): Thread 8 (Thread 0x7f7cf67fc700 (LWP 139016)): Thread 7 (Thread 0x7f7cf57fa700 (LWP 139018)): Thread 6 (Thread 0x7f7cf6ffd700 (LWP 139015)): ---Type to continue, or q to quit--- Thread 4 (Thread 0x7f7d0a0c8700 (LWP 138998)): Thread 3 (Thread 0x7f7d01e01700 (LWP 139010)): Thread 2 (Thread 0x7f7d02602700 (LWP 139009)): Thread 1 (Thread 0x7f7d0a8c9700 (LWP 138997)): I can share the core file as a link (it's more than 700MB). Is that fine with bugzilla's policy? |
Time: 20191128T09:39:55 Kind regards, |
Time: 20191202T06:34:35 I suspect whether you have provided the right coredump, because all the backtraces look usual. I'm unable to make anything out of this. Thanks, |
Time: 20191204T09:55:06 Kind regards, |
Time: 20191205T08:57:19 |
Time: 20191205T10:07:16
Looking at the backtrace you have provided, I can say that you have already installed debug-info packages. I suspect you have provided the backtrace from the wrong core file, as it doesn't have any backtrace where we can see any process crashing. Can you please cross check? Thanks, |
Time: 20191206T07:43:49 journalctl -u glusterd: gru 06 02:54:38 node02 opt-data-vmstore[22826]: pending frames: ls -l /var/tmp/abrt/ ls -l coredump So this should be the correct coredump. The brick on node02 was down since 02:54. [Thread debugging using libthread_db enabled] (gdb) t a a bt Thread 41 (Thread 0x7f82e934e700 (LWP 22832)): Thread 40 (Thread 0x7f82d026d700 (LWP 23162)): Thread 39 (Thread 0x7f82d02ef700 (LWP 23160)): Thread 38 (Thread 0x7f82c6ffd700 (LWP 22876)): Thread 37 (Thread 0x7f82dc0a1700 (LWP 22873)): Thread 36 (Thread 0x7f82c67fc700 (LWP 22880)): Thread 35 (Thread 0x7f82d0571700 (LWP 23084)): Thread 34 (Thread 0x7f82d06f7700 (LWP 22883)): Thread 33 (Thread 0x7f82d06b6700 (LWP 22884)): Thread 32 (Thread 0x7f82c77fe700 (LWP 22851)): Thread 31 (Thread 0x7f82d05b2700 (LWP 23083)): Thread 30 (Thread 0x7f82dc060700 (LWP 22882)): Thread 29 (Thread 0x7f82d02ae700 (LWP 23161)): Thread 28 (Thread 0x7f82d17fa700 (LWP 22845)): Thread 27 (Thread 0x7f82d05f3700 (LWP 23082)): Thread 26 (Thread 0x7f82d022c700 (LWP 23163)): Thread 25 (Thread 0x7f82d0634700 (LWP 23081)): ---Type to continue, or q to quit--- Thread 23 (Thread 0x7f82d0330700 (LWP 23159)): Thread 22 (Thread 0x7f82d0675700 (LWP 22885)): Thread 21 (Thread 0x7f82d01eb700 (LWP 23164)): Thread 20 (Thread 0x7f82ea350700 (LWP 22830)): Thread 19 (Thread 0x7f82eab51700 (LWP 22829)): Thread 18 (Thread 0x7f82eb352700 (LWP 22828)): Thread 17 (Thread 0x7f82dcf39700 (LWP 22837)): Thread 16 (Thread 0x7f82f4a734c0 (LWP 22826)): Thread 15 (Thread 0x7f82d0ff9700 (LWP 22846)): Thread 14 (Thread 0x7f82e9b4f700 (LWP 22831)): Thread 13 (Thread 0x7f82dca34700 (LWP 22839)): Thread 12 (Thread 0x7f82ebb53700 (LWP 22827)): Thread 11 (Thread 0x7f82f48a74c0 (LWP 730)): Thread 10 (Thread 0x7f82d27fc700 (LWP 22843)): Thread 9 (Thread 0x7f82d1ffb700 (LWP 22844)): Thread 8 (Thread 0x7f82d2ffd700 (LWP 22842)): Thread 7 (Thread 0x7f82d37fe700 (LWP 22841)): Thread 6 (Thread 0x7f82d3fff700 (LWP 22840)): Thread 5 (Thread 0x7f82e667c700 (LWP 22834)): Thread 4 (Thread 0x7f82e406a700 (LWP 22838)): Thread 3 (Thread 0x7f82de442700 (LWP 22836)): Thread 2 (Thread 0x7f82dec43700 (LWP 22835)): Thread 1 (Thread 0x7f82e6e7d700 (LWP 22833)): I guess that's all I can extract from the coredump. I can attach a link to the coredump file if that's helpful. Kind regards, |
Time: 20200107T14:21:40 |
Time: 20200114T10:28:18 Does that help a bit with troubleshooting? Kind regards, |
Time: 20200211T14:13:18 |
Time: 20200219T14:07:04 Please share if you have any updates. Thanks, |
Time: 20200225T10:59:27 Kind regards, |
Dominik, I think you don't need to upgrade, at least for now. If I understood it correctly, you said that after installing another cluster with oVirt 4.3.7 the problem disappeared. I would like that you check which version of Gluster is being used in this new cluster. If it's 6.7, that would explain the problem, because the issue seems the same as this one and it was fixed in 6.7. |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
URL: https://bugzilla.redhat.com/1773532
Creator: ddrazyk at gmail
Time: 20191118T11:33:57
Created attachment 1637263
Compressed logs from journald, glusterd and vdsm.
Description of problem:
I am running a 3 node ovirt cluster with glusterfs storage domain. Gluster is configured with lvm cache with writeback caching with hardware RAID (one virtual device is SSD and second is HDD on LSI controller) backed by xfs. There are two volumes served by this cluster: wiosna-vmstore which serves as Data storage and wiosna-iso which is an ISO domain. Both have sharding turned on. Management is on a separate physical machine.
I randomly get glusterd segfaults which cause a brick to go down (it's either iso or vmstore, never both). When two nodes get a segfault then all VM's end up in Paused state. All hosts run a
Version-Release number of selected component (if applicable):
vdsm-gluster-4.30.33-1.el7.x86_64
glusterfs-6.6-1.el7.x86_64
How reproducible:
Don't know. Occurs randomly.
Steps to Reproduce:
N/A
Actual results:
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: patchset: git://git.gluster.org/glusterfs.git
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: signal received: 11
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: time of crash:
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: 2019-11-18 00:53:27
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: configuration details:
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: argp 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: backtrace 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: dlfcn 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: libpthread 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: llistxattr 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: setfsid 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: spinlock 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: epoll.h 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: xattr.h 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: st_atim.tv_nsec 1
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: package-string: glusterfs 6.5
lis 18 01:53:27 node01.wiosna.org.pl opt-data-vmstore[15340]: ---------
Expected results:
Normal operation.
Additional info:
The text was updated successfully, but these errors were encountered: