debian systemd services don't stop volumes and daemons properly #1767

Legogris · 2020-11-08T02:58:28Z

Description of problem:

Provided service files for debian packages don't stop all gluster processes when stopped. This causes issues with stopping volumes and leaves zombie processes that have to be killed manually.

I noticed this for the past several versions, but I don't think it was the case for 5.x.

The exact command to reproduce the issue:

root@server1:/home/user# systemctl | grep gluster
  glusterd.service                                                                                      loaded active running   GlusterFS, a clustered file-system server
  glustereventsd.service                                                                                loaded active running   Gluster Events Notifier
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
     Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2020-11-06 09:06:44 UTC; 1 day 17h ago
       Docs: man:glusterd(8)
    Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1046 (glusterd)
      Tasks: 158 (limit: 4661)
     Memory: 3.1G
     CGroup: /system.slice/glusterd.service
             ├─1046 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
             ├─1178 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick >
             ├─1214 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name >
             ├─1261 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
             ├─1493 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
             └─1609 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6

Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.

● glustereventsd.service - Gluster Events Notifier
     Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2020-11-06 09:06:34 UTC; 1 day 17h ago
       Docs: man:glustereventsd(8)
   Main PID: 1034 (glustereventsd)
      Tasks: 4 (limit: 4661)
     Memory: 11.8M
     CGroup: /system.slice/glustereventsd.service
             ├─1034 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
             └─1692 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid

Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
root@server1:/home/user# systemctl stop glusterd.service glustereventsd.service
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
     Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 5s ago
       Docs: man:glusterd(8)
    Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1046 (code=exited, status=15)

Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.
Nov 08 02:47:09 server1 systemd[1]: Stopping GlusterFS, a clustered file-system server...
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1178 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1214 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1261 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1493 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1609 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: Stopped GlusterFS, a clustered file-system server.

● glustereventsd.service - Gluster Events Notifier
     Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 6s ago
       Docs: man:glustereventsd(8)
    Process: 1034 ExecStart=/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid (code=killed, signal=TERM)
   Main PID: 1034 (code=killed, signal=TERM)

Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
Nov 08 02:47:09 server1 systemd[1]: Stopping Gluster Events Notifier...
Nov 08 02:47:09 server1 systemd[1]: glustereventsd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: Stopped Gluster Events Notifier.
root@server1:/home/user# ps -Af | grep gluster
root        1178       1  5 Nov06 ?        02:15:15 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49152 --global-threading --xlator-option volume2-server.listen-port=49152
root        1214       1  6 Nov06 ?        02:51:54 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49153 --xlator-option volume1-server.listen-port=49153
root        1261       1  0 Nov06 ?        00:17:40 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
root        1493       1  0 Nov06 ?        00:00:15 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
root        1609       1  0 Nov06 ?        00:08:57 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6
root       96093   95946  0 02:47 pts/1    00:00:00 grep gluster

Expected results:
systemctl stop glusterd.service stops all volumes and processes, including bitrot and self-heal. I could also see it make sense to have self-heal and bitrot daemons as separate services, but regardless there should be a way to reliably stop any ststemd-started gluster process via systemctl.

- The operating system / glusterfs version:

debian 10 buster / debian 11 bullseye

glusterfs 8.2-1. Also true for 8.0, 8.1, and I think 7.x. It was not the case for 5.x IIRC.

The text was updated successfully, but these errors were encountered:

jronnblom · 2020-11-20T09:22:33Z

Seen something like that.

It looks like the processes are sent SIGKILL first and not SIGTERM by systemd. Maybe the glusterd.service needs some updates.

There is a script /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh that can be used to shutdown gluster.

stale · 2021-06-18T12:43:48Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

3nprob · 2021-06-19T01:23:05Z

Still an issue

aravindavk · 2021-06-19T06:49:52Z

It is intentional to not stop all the processes if Glusterd is Stopped. If Bricks are up then already connected Clients/Mounts continue to work even if Glusterd goes down. Think about restarting a Glusterd to fix an issue or to fix a memory leak, This doesn't mean all the other services should be stopped.

For now you can use the script that @jronnblom suggested #1767 (comment)

3nprob · 2021-06-19T07:08:27Z

@aravindavk The general contract should be that whatever is being brought up by systemctl start, should also be brought down by systemctl stop.

As there currently is nothing implemented for reload for glusterd.service, perhaps the scenario you describe can be addressed by reload rather than restart?

The alternative would be service splitting, either a general glusterd-bricks/glusterd-volumes or glusterd-brick@foobar.

stale · 2022-01-15T14:27:57Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

3nprob · 2022-01-15T15:38:49Z

Still an issue

stale · 2022-09-21T00:40:37Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

stale · 2022-10-22T18:16:01Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

ronnyadsetts · 2023-02-02T13:05:20Z

This is still an issue.

When a node is shut down or rebooted, the gluster volumes with bricks on the server affected all hang for the 42 seconds default timeout. This is entirely avoidable by killing all gluster processes properly during shutdown.

I understand the logic behind the glusterd.service 'restart' but it gets really annoying when all my VMs go unresponsive because an overheat caused a graceful server shutdown.

ethaniel · 2023-02-23T13:10:18Z

Still an issue in 9.4.

jetibest · 2023-06-04T14:56:27Z

Got the same issue. On Debian I don't have instances of glusterfsd, only glusterd that spawns multiple processes, which are not killed when glusterd is stopped. It's because of KillMode=process in /lib/systemd/system/glusterd.service.

To resolve this issue, I am using the following:

/etc/systemd/system/glusterd.service.d/override.conf:

[Service]
KillMode=control-group

Use systemctl daemon-reload to apply the changes.

Remember what man systemd.kill says about this:

Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

Either the KillMode should be set to control-group, or there should be separate glusterfsd services. I remember reading an issue where glusterfs didn't want to mount, because the address (port) was already in use. After a reboot it was resolved. I had the same issue, but then I found out it's because of these glusterfsd processes lingering after a restart of the glusterd service.

Legogris changed the title ~~debian systemd service files don't stop volumes properly~~ debian systemd service files don't stop volumes and daemons properly Nov 8, 2020

Legogris changed the title ~~debian systemd service files don't stop volumes and daemons properly~~ debian systemd services don't stop volumes and daemons properly Nov 8, 2020

stale bot added the wontfix Managed by stale[bot] label Jun 18, 2021

stale bot removed the wontfix Managed by stale[bot] label Jun 19, 2021

stale bot added the wontfix Managed by stale[bot] label Jan 15, 2022

stale bot removed the wontfix Managed by stale[bot] label Jan 15, 2022

stale bot added the wontfix Managed by stale[bot] label Sep 21, 2022

stale bot closed this as completed Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debian systemd services don't stop volumes and daemons properly #1767

debian systemd services don't stop volumes and daemons properly #1767

Legogris commented Nov 8, 2020

jronnblom commented Nov 20, 2020

stale bot commented Jun 18, 2021

3nprob commented Jun 19, 2021

aravindavk commented Jun 19, 2021

3nprob commented Jun 19, 2021 •

edited

stale bot commented Jan 15, 2022

3nprob commented Jan 15, 2022

stale bot commented Sep 21, 2022

stale bot commented Oct 22, 2022

ronnyadsetts commented Feb 2, 2023

ethaniel commented Feb 23, 2023

jetibest commented Jun 4, 2023 •

edited

debian systemd services don't stop volumes and daemons properly #1767

debian systemd services don't stop volumes and daemons properly #1767

Comments

Legogris commented Nov 8, 2020

jronnblom commented Nov 20, 2020

stale bot commented Jun 18, 2021

3nprob commented Jun 19, 2021

aravindavk commented Jun 19, 2021

3nprob commented Jun 19, 2021 • edited

stale bot commented Jan 15, 2022

3nprob commented Jan 15, 2022

stale bot commented Sep 21, 2022

stale bot commented Oct 22, 2022

ronnyadsetts commented Feb 2, 2023

ethaniel commented Feb 23, 2023

jetibest commented Jun 4, 2023 • edited

3nprob commented Jun 19, 2021 •

edited

jetibest commented Jun 4, 2023 •

edited