-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse causes glusterd to dump core #1225
Comments
This is a known issue and it is fixed from the patch (https://review.gluster.org/#/c/glusterfs/+/24231/) |
I think there must be some corner case not fixed by the patch, because it shouldn't fail in 7.5 (the patch is already present in 7.5) |
While the cause is analyzed, you can disable open-behind to avoid the crash: # gluster volume set <volname> open-behind off |
|
Can you post the output of |
Here are the log files: And zipped core: |
I'm not sure why, but open-behind is still enabled (it should appear as disabled in the Looking at the pgdata log, I can also see that open-behind is present in the last configuration just before crashing, and the crash is related to open-behind. Can you disable it again and check that it's actually disabled with a |
I have similar setup on Centos7 and i am not able to reproduce this situation there. Only with ubuntu and on second node.
|
A patch https://review.gluster.org/24451 has been posted that references this issue. open-behind: rewrite of internal logic There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a To avoid this problem, the previous implementation didn't take a To fix this, I've implemented a new xlator cbk that gets called from The whole logic of handling background opens have been simplified and Correctly handling the close request while the open is still pending Change-Id: I6376a5491368e0e1c283cc452849032636261592 |
A patch https://review.gluster.org/24542 has been posted that references this issue. open-behind: rewrite of internal logic There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a To avoid this problem, the previous implementation didn't take a To fix this, I've implemented a new xlator cbk that gets called from The whole logic of handling background opens have been simplified and Correctly handling the close request while the open is still pending Change-Id: I6376a5491368e0e1c283cc452849032636261592 |
A patch https://review.gluster.org/24544 has been posted that references this issue. open-behind: rewrite of internal logic There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a To avoid this problem, the previous implementation didn't take a To fix this, I've implemented a new xlator cbk that gets called from The whole logic of handling background opens have been simplified and Correctly handling the close request while the open is still pending Change-Id: I6376a5491368e0e1c283cc452849032636261592 |
The patch posted should fix the issue, but it's a big change, so I recommend testing it before going to production with open-behind enabled. |
There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
[root@test1 /]# ll /var/lib/portsip/pgsql/data/pg_stat_tmp/ |
I also encounter this problem. Running PgSQL will cause PgSQL to fail to start when a gluster node is unavailable |
@FrelDX how is your problem related with this issue ? Does the problem disappear if you disable |
@FrelDX Disable open-behind has helped my a lot. |
Whilst open-behind helps our case a LOT it does not eliminate it. |
@jkroonza Please open a new issue with the stacktrace when it happens again. This issue is tracking open-behind issue which is now fixed in both the latest release-7 and 8 and master. Closing it. |
There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Upstream patch: > Upstream-patch-link: https://review.gluster.org/#/c/glusterfs/+/24451 > Change-Id: I6376a5491368e0e1c283cc452849032636261592 > Fixes: gluster#1225 > Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> BUG: 1830713 Change-Id: I6376a5491368e0e1c283cc452849032636261592 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Reviewed-on: https://code.engineering.redhat.com/gerrit/224487 Tested-by: RHGS Build Bot <nigelb@redhat.com> Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu@redhat.com>
Description of problem:
Created 2 mirrored gluserfs volumes. On second node fuse causes glusterd to crash while issuing "pg_ctl initdb" against glusterfs mount. First node (xt-ha1) seems to be not affected. Only when issuing "initdb" command on second node (xt-ha2) causes glusterd to crash.
Both machines are deployed from same vmware template, both are updated and have same software/patchlevel versions.
The exact command to reproduce the issue:
postgres@xt-ha2:~$ /usr/lib/postgresql/10/bin/pg_ctl initdb -D /pgdata/pgdata
The full output of the command that failed:
postgres@xt-ha2:~$ /usr/lib/postgresql/10/bin/pg_ctl initdb -D /pgdata/pgdata
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /pgdata/pgdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Europe/Tallinn
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... 2020-05-07 10:51:37.616 EEST [8986] LOG: could not open file "pg_wal/000000010000000000000001": Software caused connection abort
2020-05-07 10:51:37.616 EEST [8986] FATAL: could not open file "pg_wal/000000010000000000000001": Transport endpoint is not connected
child process exited with exit code 1
initdb: removing contents of data directory "/pgdata/pgdata"
could not open directory "/pgdata/pgdata": Transport endpoint is not connected
initdb: failed to remove contents of data directory
pg_ctl: database system initialization failed
Expected result:
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /pgdata/pgdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Europe/Tallinn
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
/usr/lib/postgresql/10/bin/pg_ctl -D /pgdata/pgdata -l logfile start
postgres@xt-ha1:~$
Stack trace:
glusterd_trace.txt
Additional info:
/etc/hosts:
192.168.57.186 xt-ha1.example.com
192.168.57.187 xt-ha2.example.com
/dev/mapper/vgglupgdata-data01 on /glupgdata type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/vgglupgbackup-backup on /glupgbackup type xfs (rw,relatime,attr2,inode64,noquota)
192.168.57.187:/glu-pgdata on /pgdata type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
192.168.57.187:/glu-pgbackup on /pgbackup type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)
- The output of the
gluster volume info
command:Volume Name: glu-pgbackup
Type: Replicate
Volume ID: 30d323bd-3eab-4e36-9e14-c1508b03b804
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: xt-ha1.example.com:/glupgbackup/pgbackup
Brick2: xt-ha2.example.com:/glupgbackup/pgbackup
Options Reconfigured:
cluster.self-heal-daemon: enable
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
features.barrier: disable
Volume Name: glu-pgdata
Type: Replicate
Volume ID: 232c30d0-8c5e-4a71-9fa6-45f39d64fc6c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: xt-ha1.example.com:/glupgdata/pgdata
Brick2: xt-ha2.example.com:/glupgdata/pgdata
Options Reconfigured:
features.barrier: disable
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.self-heal-daemon: enable
- The operating system / glusterfs version:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
The text was updated successfully, but these errors were encountered: