Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glusterfs v10.4 'No space left on device' yet we have plenty of space all nodes #4135

Open
brandonshoemakerTH opened this issue May 7, 2023 · 38 comments

Comments

@brandonshoemakerTH
Copy link

Description of problem:
We are seeing 'error=No space left on device' issue on Glusterfs 10.4 on AlmaLinux 8 (4.18.0-425.19.2.el8_7.x86_64) even though we have currently 61 TB available on the volume and each of the 12 nodes have 2-8 TB free so we are nowhere near out of space on any node.

#example log msg from /var/log/glusterfs/home-volbackups.log
[2023-05-06 23:47:38.645324 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:670:client4_0_writev_cbk] 0-volbackups-client-23: remote operation failed. [{errno=28}, {error=No space left on device}]
[2023-05-06 23:47:38.645376 +0000] W [fuse-bridge.c:1970:fuse_err_cbk] 0-glusterfs-fuse: 980901423: FLUSH() ERR => -1 (No space left on device)

The exact command to reproduce the issue:
We use vsftpd and glusterfs for around 8 years for ftp uploads of backup files and around 3 years for nfs uploads of backup files. Each glusterfs node has a single brick and mounts locally a single distributed volume as a glusterfs client locally and receives ftp>vsftpd>glusterfs backup files to the volume each weekend. After about 24 hours of ftp uploads the no space error starts in the logs and then writes start failing. However, we have plenty of space all nodes and we are using 'cluster.min-free-disk: 1GB' volume setting. If we reboot all the glusterfs nodes the problem goes away for a while but, then returns again after ~12-24 hours.

The full output of the command that failed:
Here is an example ftp backup file upload that fails this weekend:
put: 125ac755-05b1-4d48-9a7d-96e7cd423700-vda.bak: Access failed: 553 Could not create file. (125ac755-05b1-4d48-9a7d-96e7cd423700-vda.qcow2)

Here are some example nfs backup file writes that fail from last weekend:
/bin/cp: failed to close '/backups/instance-00016239.xml': No space left on device
/bin/cp: failed to close '/backups/instance-00016221.xml': No space left on device
/bin/cp: failed to close '/backups/instance-00016248.xml': No space left on device
/bin/cp: failed to close '/backups/instance-0001625a.xml': No space left on device
qemu-img: error while writing sector 19931136: No space left on device
qemu-img: Failed to flush the L2 table cache: No space left on device
qemu-img: Failed to flush the refcount block cache: No space left on device
qemu-img: /backups/2699ee2f-92b8-4804-a7c7-1dc4e2abed29-vda.qcow2: error while converting qcow2: Could not close the new file: No space left on device
/bin/cp: failed to close '/backups/73fa3986-f450-4b36-b7d4-dcbdcd494562-instance-0001609e-disk.config': No space left on device
/bin/cp: failed to close '/backups/instance-00016104.xml': No space left on device
/bin/cp: failed to close '/backups/5c82fbdb-2be7-45fe-871d-604453868edc-instance-000160f2-disk.config': No space left on device
/bin/cp: failed to close '/backups/24acc824-94d5-4026-9abe-072a1b257cc0-instance-00016119-disk.info': No space left on device
/bin/cp: failed to close '/backups/instance-0001611f.xml': No space left on device
/bin/cp: failed to close '/backups/instance-0001613d.xml': No space left on device

Expected results:
It is expected for ftp and nfs upload writes to succeed as they have in the past.

Mandatory info:
- The output of the gluster volume info command:

[root@nybaknode1 ~]# gluster volume info volbackups

Volume Name: volbackups
Type: Distribute
Volume ID: cd40794d-ab74-4706-a0bc-3e95bb8c63a2
Status: Started
Snapshot Count: 0
Number of Bricks: 12
Transport-type: tcp
Bricks:
Brick1: nybaknode9.domain.net:/lvbackups/brick
Brick2: nybaknode11.domain.net:/lvbackups/brick
Brick3: nybaknode2.domain.net:/lvbackups/brick
Brick4: nybaknode3.domain.net:/lvbackups/brick
Brick5: nybaknode4.domain.net:/lvbackups/brick
Brick6: nybaknode12.domain.net:/lvbackups/brick
Brick7: nybaknode5.domain.net:/lvbackups/brick
Brick8: nybaknode6.domain.net:/lvbackups/brick
Brick9: nybaknode7.domain.net:/lvbackups/brick
Brick10: nybaknode8.domain.net:/lvbackups/brick
Brick11: nybaknode10.domain.net:/lvbackups/brick
Brick12: nybaknode1.domain.net:/lvbackups/brick
Options Reconfigured:
performance.cache-size: 256MB
server.event-threads: 16
performance.io-thread-count: 32
performance.client-io-threads: on
client.event-threads: 16
diagnostics.brick-sys-log-level: WARNING
diagnostics.brick-log-level: WARNING
performance.cache-max-file-size: 2MB
transport.address-family: inet
nfs.disable: on
cluster.min-free-disk: 1GB
[root@nybaknode1 ~]#

- The output of the gluster volume status command:

[root@nybaknode1 ~]# gluster volume status volbackups
Status of volume: volbackups
Gluster process TCP Port RDMA Port Online Pid

Brick nybaknode9.domain.net:/lvbackups/b
rick 59026 0 Y 1986
Brick nybaknode11.domain.net:/lvbackups/
brick 60172 0 Y 2033
Brick nybaknode2.domain.net:/lvbackups/b
rick 58067 0 Y 1579
Brick nybaknode3.domain.net:/lvbackups/b
rick 58210 0 Y 1603
Brick nybaknode4.domain.net:/lvbackups/b
rick 52719 0 Y 1681
Brick nybaknode12.domain.net:/lvbackups/
brick 52193 0 Y 1895
Brick nybaknode5.domain.net:/lvbackups/b
rick 53655 0 Y 1667
Brick nybaknode6.domain.net:/lvbackups/b
rick 56614 0 Y 1591
Brick nybaknode7.domain.net:/lvbackups/b
rick 49492 0 Y 1719
Brick nybaknode8.domain.net:/lvbackups/b
rick 51497 0 Y 1701
Brick nybaknode10.domain.net:/lvbackups/
brick 49787 0 Y 1878
Brick nybaknode1.domain.net:/lvbackups/b
rick 52392 0 Y 1781

Task Status of Volume volbackups

Task : Rebalance
ID : 1ea52278-ea1b-4d7e-857a-fe2ee1dc5420
Status : completed

[root@nybaknode1 ~]#

- The output of the gluster volume heal command:

Not relevant. We are using a plain distributed no replica

- The output of the gluster volume status detail command:

[root@nybaknode1 ~]# gluster volume status volbackups detail
Status of volume: volbackups

Brick : Brick nybaknode9.domain.net:/lvbackups/brick
TCP Port : 59026
RDMA Port : 0
Online : Y
Pid : 1986
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 4.6TB
Total Disk Space : 29.0TB
Inode Count : 3108974976
Free Inodes : 3108903409

Brick : Brick nybaknode11.domain.net:/lvbackups/brick
TCP Port : 60172
RDMA Port : 0
Online : Y
Pid : 2033
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 8.2TB
Total Disk Space : 43.5TB
Inode Count : 4672138432
Free Inodes : 4672063970

Brick : Brick nybaknode2.domain.net:/lvbackups/brick
TCP Port : 58067
RDMA Port : 0
Online : Y
Pid : 1579
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 5.4TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108849261

Brick : Brick nybaknode3.domain.net:/lvbackups/brick
TCP Port : 58210
RDMA Port : 0
Online : Y
Pid : 1603
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 4.6TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108849248

Brick : Brick nybaknode4.domain.net:/lvbackups/brick
TCP Port : 52719
RDMA Port : 0
Online : Y
Pid : 1681
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 5.0TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108848785

Brick : Brick nybaknode12.domain.net:/lvbackups/brick
TCP Port : 52193
RDMA Port : 0
Online : Y
Pid : 1895
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 7.5TB
Total Disk Space : 43.5TB
Inode Count : 4671718976
Free Inodes : 4671644748

Brick : Brick nybaknode5.domain.net:/lvbackups/brick
TCP Port : 53655
RDMA Port : 0
Online : Y
Pid : 1667
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 3.3TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108849458

Brick : Brick nybaknode6.domain.net:/lvbackups/brick
TCP Port : 56614
RDMA Port : 0
Online : Y
Pid : 1591
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota
Inode Size : 512
Disk Space Free : 5.4TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108849533

Brick : Brick nybaknode7.domain.net:/lvbackups/brick
TCP Port : 49492
RDMA Port : 0
Online : Y
Pid : 1719
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 2.4TB
Total Disk Space : 14.4TB
Inode Count : 1546333376
Free Inodes : 1546264508

Brick : Brick nybaknode8.domain.net:/lvbackups/brick
TCP Port : 51497
RDMA Port : 0
Online : Y
Pid : 1701
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota
Inode Size : 512
Disk Space Free : 4.4TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108849200

Brick : Brick nybaknode10.domain.net:/lvbackups/brick
TCP Port : 49787
RDMA Port : 0
Online : Y
Pid : 1878
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota
Inode Size : 512
Disk Space Free : 6.7TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108850142

Brick : Brick nybaknode1.domain.net:/lvbackups/brick
TCP Port : 52392
RDMA Port : 0
Online : Y
Pid : 1781
File System : xfs
Device : /dev/mapper/vgbackups-lvbackups
Mount Options : rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=128,swidth=128,noquota
Inode Size : 512
Disk Space Free : 6.6TB
Total Disk Space : 29.0TB
Inode Count : 3108921344
Free Inodes : 3108850426

[root@nybaknode1 ~]#

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

Here are sanitized logs attached from one of the affected gluster nodes that experienced the issue today and last week. If you need more logs please notify back and we are willing to share more logs directly with someone. We have 12 glusterfs nodes in this location for our backups.

**- Is there any crash ? Provide the backtrace and coredump

No crash is involved as far as i know

Additional info:

We are seeing 'error=No space left on device' issue on Glusterfs 10.4 on AlmaLinux 8 (4.18.0-425.19.2.el8_7.x86_64) and hoping someone might could help advise as its become critical since we use glusterfs for backups of entire infrastucture for this affected location (NYC). We have another different location similarly configured on 10.3 not yet experiencing this issue but, its about 60% smaller size by number of nodes.

We are using a 12 node glusterfs v10.4 (plain) distributed vsftpd backup cluster for years (not new) and recently 3-4 weeks ago upgraded to v9 > v10.4. I do not know if the upgrade is related to this new issue.

We are seeing a new issue 'error=No space left on device' error below on multiple gluster v10.4 nodes in the logs. At the moment seeing it in the logs for about half (5 out of 12) of the nodes last week and 2 more today before i rebooted. The issue will go away if we reboot all the glusterfs nodes but, backups take a little over 2 days to complete each weekend and the issue returns after about 1 day of backups running and before the backup cycle is complete. It has been happening the last 3 weekends we have run backups to these nodes.

#example log msg from /var/log/glusterfs/home-volbackups.log
[2023-05-06 23:47:38.645324 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:670:client4_0_writev_cbk] 0-volbackups-client-23: remote operation failed. [{errno=28}, {error=No space left on device}]
[2023-05-06 23:47:38.645376 +0000] W [fuse-bridge.c:1970:fuse_err_cbk] 0-glusterfs-fuse: 980901423: FLUSH() ERR => -1 (No space left on device)

Each glusterfs node has a single brick and mounts locally a single distributed volume as a glusterfs client locally and receives over ftp and over nfs-ganesha our backup files to the volume each weekend. This weekend we tested only ftp uploads and the problem happened the same with or without nfs-ganesha backup file uploads.

We distribute the ftp upload load between the servers through a combination of /etc/hosts entries and AWS weighted dns. We also use nfs-ganesha but, this weekend we ran only FTP backup uploads as a test to rule out nfs-ganesha and just experienced the same issue with ftp uploads only.

We have currently 61 TB available on the volume though and each of the 12 nodes have 2-8 TB free so we are nowhere near out of space on any node?

We have already tried the setting change from 'cluster.min-free-disk: 1%' to 'cluster.min-free-disk: 1GB' and rebooted all the gluster nodes to refresh them and it happened again. That was mentioned in this doc https://access.redhat.com/solutions/276483 as an idea.

Does anyone know what we might check next?

Crossposted to https://lists.gluster.org/pipermail/gluster-users/2023-May/040289.html

- The operating system / glusterfs version:

Almalinux 8 4.18.0-425.19.2.el8_7.x86_64
[root@nybaknode1 ~]# rpm -qa | grep 'gluster|nfs'
nfs-ganesha-selinux-3.5-3.el8.noarch
glusterfs-client-xlators-10.4-1.el8s.x86_64
nfs-ganesha-utils-3.5-3.el8.x86_64
glusterfs-selinux-2.0.1-1.el8s.noarch
libglusterd0-10.4-1.el8s.x86_64
nfs-ganesha-gluster-3.5-3.el8.x86_64
libnfsidmap-2.3.3-57.el8_7.1.x86_64
libglusterfs0-10.4-1.el8s.x86_64
glusterfs-cli-10.4-1.el8s.x86_64
glusterfs-server-10.4-1.el8s.x86_64
nfs-ganesha-3.5-3.el8.x86_64
centos-release-nfs-ganesha30-1.0-2.el8.noarch
glusterfs-fuse-10.4-1.el8s.x86_64
sssd-nfs-idmap-2.7.3-4.el8_7.3.x86_64
centos-release-gluster10-1.0-1.el8.noarch
glusterfs-10.4-1.el8s.x86_64
nfs-utils-2.3.3-57.el8_7.1.x86_64
[root@nybaknode1 ~]#

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

logs-screenshots-sanitized.zip

@mohit84
Copy link
Contributor

mohit84 commented May 7, 2023

In release(10.4) recently we did change the code-path to respect the storage.reserve value by the patch(#3636), due to that you are facing an issue.
For the time being, i would suggest downgrading the glusterfs on release-10.3 to avoid this issue.
I will try to fix the same.

@mohit84
Copy link
Contributor

mohit84 commented May 7, 2023

Can you please take statedump of any one brick process that is throwing "No space left on device" error currently? To take a statedump you have to send a SIGUSR1 signal to the brick process "kill -SIGUSR1 <brick_pid>", the command will generate a
statedump in /var/run/gluster directory.

@brandonshoemakerTH
Copy link
Author

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump.

Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

@brandonshoemakerTH
Copy link
Author

Re-opening. Sorry it seemed it closed on last reply

@mohit84
Copy link
Contributor

mohit84 commented May 7, 2023

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump.

Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

The downgrade procedure is similar to the upgrade, you need to follow the same process. Yes, it is completely safe.

@mohit84
Copy link
Contributor

mohit84 commented May 7, 2023

Hi @mohit84, Thanks so much for the prompt reply and advise. I had to reboot all the nodes just before i posted this issue here to clear the issue so it will take another 12-24 hours before we see the issue reoccur but, it will so will come back with the requested statedump.
Can you point me to any docs or advise the basic approach to follow for a downgrade to 10.3 on RHEL8/AlmaLinux8? Is it a reliable procedure? Unfortunately, i'm not familiar with what all this would entail and this is a 12 node 362 TB backup volume. 'yum downgrade [glusterfs-server-pkg]' does not offer anything so seems would be something more manual for a process.

The downgrade procedure is similar to the upgrade, you need to follow the same process. Yes, it is completely safe.

You can try once in test environment if you are hesitant to try in the production environment.

@brandonshoemakerTH
Copy link
Author

Ok yea i will setup test server to test it. I will look for the 10.3 packages tomorrow as it is midnight here. Thanks for advice.

@brandonshoemakerTH
Copy link
Author

Hi @mohit84

I have the statedump file now as we had the issue happen again in the last hour. I've sanitized the file i think and removed our domain references. Is there anything else in this file that might be sensitive besides the domain/hostname reference? Its 230215 lines so not able to check it and be sure.

Is it possible i can send this to you privately somehow or only through a public reply post here or is the only thing sensitive in the file the hostname and directory path references?

@mohit84
Copy link
Contributor

mohit84 commented May 8, 2023

Yes you can share it on my mail id moagrawa@redhat.com

@brandonshoemakerTH
Copy link
Author

Thanks @mohit84 we sent the statedump and other log files.
We have downgraded to 10.3 this morning and will re-run our backups to these glusterfs 10.3 servers.
Do let us know if we can assist your team with anything else regarding this issue.
We will report back in a few days after backups hopefully complete without re-encountering the issue.

@eg-ops
Copy link

eg-ops commented May 11, 2023

Since updating to version 10.4, we have been facing the same issue. After a couple of hours, we receive the error message 'No space left on device', and we have to restart all three GlusterFS nodes. After that, it works for the next couple of hours until we encounter the same issue again.

@brandonshoemakerTH
Copy link
Author

brandonshoemakerTH commented May 11, 2023

@mohit84 Our backups sent to these glusterfs nodes did complete after 2 days running without encountering the issue again after downgrading to 10.3. We appreciate your help on this issue.

@eg-ops you should consider the same 10.3 downgrade. It does seem to be an issue in 10.4 and not affecting 10.3 from the testing we just did.

@FleloShe
Copy link

FleloShe commented May 22, 2023

Hi there, we are currently experiencing the same issue with 10.4. Unfortunately we can't find the 10.3 package for ubuntu (specifically ubuntu 18 bionic). It would be awesome to get some hints where to get the packages!

@xhernandez
Copy link
Contributor

@brandonshoemakerTH @eg-ops @FleloShe do you create hard linked files in the Gluster volume that get deleted (at least one of the hardlinks) regularly ?

@brandonshoemakerTH
Copy link
Author

@xhernandez no hardlinks used by us
@FleloShe sorry i'm not so familiar with gluster pkgs on ubuntu

last 2 weeks we have no seen the issue re-occur on 10.3

@FleloShe
Copy link

FleloShe commented May 26, 2023

@xhernandez in our case only one brick appears to be affected cause only 1 gluster node out of 4 is updated from 10.2 to 10.4.
The related volume is only used for persisting data for a dockerized redis-instance. I can't really tell what redis does there, but it appears it creates a dump file every X minutes which should be absolutely doable for gluster.

Edit:
Log from /var/log/glusterfs/bricks/glusterfs-myvolumename-vol.log
[2023-05-26 08:47:10.244980 +0000] E [MSGID: 115067] [server-rpc-fops_v2.c:1324:server4_writev_cbk] 0-myvolumename-vol-server: WRITE info [{frame=168085833}, {WRITEV_fd_no=0}, {uuid_utoa=00afcfe7-5701-418e-b8f8-ff1984032a68}, {client=CTX_ID:c70f43ca-2c20-41fa-b7e2-9786339b84fa-GRAPH_ID:0-PID:3542-HOST:myhostname-PC_NAME:myvolumename-vol-client-0-RECON_NO:-6}, {error-xlator=myvolumename-vol-posix}, {errno=28}, {error=No space left on device}]

@nikow
Copy link

nikow commented Jun 4, 2023

Can i safely downgrade from 11.0 to 10.03, too?

I spotted that i stop the volume and start it back, it start working again. Another thing is that i can increase 'time before locking again' by increasing amount of file descriptors.

@mohit84
Copy link
Contributor

mohit84 commented Jun 5, 2023

Can i safely downgrade from 11.0 to 10.03, too?

I spotted that i stop the volume and start it back, it start working again. Another thing is that i can increase 'time before locking again' by increasing amount of file descriptors.

Yes you can downgrade safely. Would it be possible for you to share the reproducer steps we are not facing any issue in our daily regression test build server?

@ben-xo
Copy link

ben-xo commented Jul 17, 2023

Following this issue as we have started to encounter it on 10.4 as well

@ufou
Copy link

ufou commented Jul 17, 2023

Are there any sysctl or glusterfs values which could be tuned to help delay this error until a permanent fix is created?

@ufou
Copy link

ufou commented Jul 17, 2023

I tried downgrading a single node (Ubuntu Jammy running 10.4 package from http://ppa.launchpad.net/gluster/glusterfs-10/ubuntu) after creating some 10.3 Ubuntu jammy packages - unfortunately after installing the 10.3 packages, although the glusterd process starts normally, the gluster brick processes fail to start:

[2023-07-17 11:58:48.847641 +0000] E [MSGID: 106005] [glusterd-utils.c:6917:glusterd_brick_start] 0-management: Unable to start brick server1:/media/storage

and brick logs:

[2023-07-17 11:58:48.773195 +0000] W [MSGID: 101095] [xlator.c:392:xlator_dynload] 0-xlator: DL open failed [{error=/usr/lib/x86_64-linux-gnu/glusterfs/10.3/xlator/protocol/server.so: undefined symbol:
xdr_gfx_readdir_rsp}]
[2023-07-17 11:58:48.773216 +0000] E [MSGID: 101002] [graph.y:211:volume_type] 0-parser: Volume 'storage-server', line 133: type 'protocol/server' is not valid or not found on this machine
[2023-07-17 11:58:48.773242 +0000] E [MSGID: 101019] [graph.y:321:volume_end] 0-parser: "type" not specified for volume storage-server
[2023-07-17 11:58:48.773539 +0000] E [MSGID: 100026] [glusterfsd.c:2509:glusterfs_process_volfp] 0-: failed to construct the graph []

Should I try 10.2?

@ufou
Copy link

ufou commented Jul 17, 2023

OK, ignore the last comment, I neglected to install all the supporting libs created by the build.sh script, so this now works to downgrade to 10.3:

dpkg -i libgfrpc0_10.3-ubuntu1~jammy1_amd64.deb libgfapi0_10.3-ubuntu1~jammy1_amd64.deb libgfchangelog0_10.3-ubuntu1~jammy1_amd64.deb glusterfs-client_10.3-ubuntu1~jammy1_amd64.deb glusterfs-common_10.3-ubuntu1~jammy1_amd64.deb glusterfs-server_10.3-ubuntu1~jammy1_amd64.deb libgfxdr0_10.3-ubuntu1~jammy1_amd64.deb  libglusterd0_10.3-ubuntu1~jammy1_amd64.deb libglusterfs0_10.3-ubuntu1~jammy1_amd64.deb libglusterfs-dev_10.3-ubuntu1~jammy1_amd64.deb

@sulphur
Copy link

sulphur commented Aug 8, 2023

I encountered the same "error=No space left on device" issue, even though I had free space. However, in my case, the partitions where the bricks are located have run out of i-nodes. I'm posting this here in case someone else experiences the same problem.

@NHellFire
Copy link

Setting storage.reserve (I used 5GB) on each volume fixed this for me with 10.4.

@baskinsy
Copy link

baskinsy commented Aug 22, 2023

I have the same issue on a single brick distributed volume with 10.4. Stopping and starting the volume resolves it temporary. The storage.reserve 1GB didn't helped in our case.

@baskinsy
Copy link

baskinsy commented Oct 1, 2023

We are constantly hitting this issue on a single brick distributed volume (the most simple type of volume and installation), no other nodes, only one node with one brick, no special settings, typical installation according to the documentation. It works for sometime after stop-start and then again the same. This is getting very frustrating and makes glusterfs unusable. Please provide packages to downgrade to 10.3.

@dubsalicious
Copy link

Having hit this same issue, I've attempted the storage.reserve fix with no luck. I also attempted to downgrade to version 10.3 (using Debian's packages) and 10.1 (Using the built in Ubuntu packages), but in both cases the volume wouldn't start because of an "undefined symbol" error. In one case it was "mem_pools" and in the other it was "mem_pools_init".

@NHellFire
Copy link

Setting storage.reserve (I used 5GB) on each volume fixed this for me with 10.4.

Update: That only fixed it temporarily. I'm now back to almost every write returning no space left, despite the least amount of free space in the cluster being 200GB. Setting storage.reserve is no longer making a difference. I've now upgraded all nodes to 11 and it's working again.

@AmineYagoub
Copy link

This is same issue on v 11, please any tutorial on how to downgrade to v10.3 on Ubuntu 22.04 ?

@Arakmar
Copy link
Contributor

Arakmar commented Oct 10, 2023

For those interested, I published fixed packages on my PPA for 22.04 and 20.04. It's based on official 10.4 packages plus the patch fixing the issue (8830f22). The upgrade should be automatic if you are using packages from the official PPA.
https://launchpad.net/~yoann-laissus/+archive/ubuntu/gluster

@dubsalicious
Copy link

Will this fix be included in a Gluster 10.5 release? The tenetative date for that to be released was 2 weeks ago so its possibly due imminently? I hope the PPA above helps people, unfortunately I don't think I'll have to wait for an official release.

@lwierzch
Copy link

lwierzch commented Nov 1, 2023

My team is seeing the symptoms from this bug almost daily. Are there any updates on the release date?

@baskinsy
Copy link

baskinsy commented Nov 6, 2023

For those interested, I published fixed packages on my PPA for 22.04 and 20.04. It's based on official 10.4 packages plus the patch fixing the issue (8830f22). The upgrade should be automatic if you are using packages from the official PPA. https://launchpad.net/~yoann-laissus/+archive/ubuntu/gluster

We can confirm that after installing the packages, restarting glusterd and a stop-start on the volume, the issue seems to have been resolved. Thank you.

@baskinsy
Copy link

10.5 is released. I was not able to verify if it includes the fix mentioned here and we cannot test it on our system. It would be good if someone can share that info.

@mdetrano
Copy link

I've done a little testing on 10.5 and it seems to be ok... by that I mean I ran a stress test script to write, read, and move files and let it run for a length of time where the same test would usually produce the "out of space" error. In 19 hours it didn't show any problems but that's just my basic test of a two node replication setup.

@Franco-Sparrow
Copy link

Franco-Sparrow commented Dec 6, 2023

Hi, reactivating this issue.

I am using 10.4 and started to the see this issue on a "Two ways distributed replicated volume". Downgrade to 10.3 is not an option, as there are worse issues on that version , related to bricks disconnections that push us to upgrade to 10.4, that fix most of them and now 10.5 fix even more these errors.

Did anyone could confirm that 10.5 fix this issue? As for now, only @mdetrano made some tests under 10.5 and seems to be OK, but it will be nice if we help each others and share if this issue related with "no left space" was finally resolved.

Thanks in advance

@Arakmar
Copy link
Contributor

Arakmar commented Dec 6, 2023

After several months on now 10.5 and also a custom 10.4 build (with 8830f22 which is included in 10.5), I can confirm the issue is definitely gone for us.

@Franco-Sparrow
Copy link

After several months on now 10.5 and also a custom 10.4 build (with 8830f22 which is included in 10.5), I can confirm the issue is definitely gone for us.

Thanks Sir

imagen

Looking on the bug fixes of 10.5...it looks like it was included the patch. Thanks also for your confirmation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests