Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debian systemd services don't stop volumes and daemons properly #1767

Closed
Legogris opened this issue Nov 8, 2020 · 12 comments
Closed

debian systemd services don't stop volumes and daemons properly #1767

Legogris opened this issue Nov 8, 2020 · 12 comments
Labels
wontfix Managed by stale[bot]

Comments

@Legogris
Copy link

Legogris commented Nov 8, 2020

Description of problem:

Provided service files for debian packages don't stop all gluster processes when stopped. This causes issues with stopping volumes and leaves zombie processes that have to be killed manually.

I noticed this for the past several versions, but I don't think it was the case for 5.x.

The exact command to reproduce the issue:

root@server1:/home/user# systemctl | grep gluster
  glusterd.service                                                                                      loaded active running   GlusterFS, a clustered file-system server
  glustereventsd.service                                                                                loaded active running   Gluster Events Notifier
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
     Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2020-11-06 09:06:44 UTC; 1 day 17h ago
       Docs: man:glusterd(8)
    Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1046 (glusterd)
      Tasks: 158 (limit: 4661)
     Memory: 3.1G
     CGroup: /system.slice/glusterd.service
             ├─1046 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
             ├─1178 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick >
             ├─1214 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name >
             ├─1261 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
             ├─1493 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
             └─1609 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6

Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.

● glustereventsd.service - Gluster Events Notifier
     Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2020-11-06 09:06:34 UTC; 1 day 17h ago
       Docs: man:glustereventsd(8)
   Main PID: 1034 (glustereventsd)
      Tasks: 4 (limit: 4661)
     Memory: 11.8M
     CGroup: /system.slice/glustereventsd.service
             ├─1034 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
             └─1692 /usr/bin/python3 /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid

Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
root@server1:/home/user# systemctl stop glusterd.service glustereventsd.service
root@server1:/home/user# systemctl status glusterd.service glustereventsd.service
● glusterd.service - GlusterFS, a clustered file-system server
     Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 5s ago
       Docs: man:glusterd(8)
    Process: 1032 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 1046 (code=exited, status=15)

Nov 06 09:06:34 server1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 06 09:06:44 server1 systemd[1]: Started GlusterFS, a clustered file-system server.
Nov 08 02:47:09 server1 systemd[1]: Stopping GlusterFS, a clustered file-system server...
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1178 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1214 (glusterfsd) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1261 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1493 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: glusterd.service: Unit process 1609 (glusterfs) remains running after unit stopped.
Nov 08 02:47:09 server1 systemd[1]: Stopped GlusterFS, a clustered file-system server.

● glustereventsd.service - Gluster Events Notifier
     Loaded: loaded (/lib/systemd/system/glustereventsd.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2020-11-08 02:47:09 UTC; 6s ago
       Docs: man:glustereventsd(8)
    Process: 1034 ExecStart=/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid (code=killed, signal=TERM)
   Main PID: 1034 (code=killed, signal=TERM)

Nov 06 09:06:34 server1 systemd[1]: Started Gluster Events Notifier.
Nov 08 02:47:09 server1 systemd[1]: Stopping Gluster Events Notifier...
Nov 08 02:47:09 server1 systemd[1]: glustereventsd.service: Succeeded.
Nov 08 02:47:09 server1 systemd[1]: Stopped Gluster Events Notifier.
root@server1:/home/user# ps -Af | grep gluster
root        1178       1  5 Nov06 ?        02:15:15 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume2.server1.localdomain.data-brick1-b1 -p /var/run/gluster/vols/volume2/server1.localdomain-data-brick1-b1.pid -S /var/run/gluster/cef39469c59c165a.socket --brick-name /data/brick1/b1 -l /var/log/glusterfs/bricks/data-brick1-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49152 --global-threading --xlator-option volume2-server.listen-port=49152
root        1214       1  6 Nov06 ?        02:51:54 /usr/sbin/glusterfsd -s server1.localdomain --volfile-id volume1.server1.localdomain.data-brick2-b1 -p /var/run/gluster/vols/volume1/server1.localdomain-data-brick2-b1.pid -S /var/run/gluster/64afd89aabbe69d4.socket --brick-name /data/brick2/b1 -l /var/log/glusterfs/bricks/data-brick2-b1.log --xlator-option *-posix.glusterd-uuid=GUID1 --process-name brick --brick-port 49153 --xlator-option volume1-server.listen-port=49153
root        1261       1  0 Nov06 ?        00:17:40 /usr/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p /var/run/gluster/bitd/bitd.pid -l /var/log/glusterfs/bitd.log -S /var/run/gluster/9bbe88f3027a5730.socket --global-timer-wheel
root        1493       1  0 Nov06 ?        00:00:15 /usr/sbin/glusterfs -s localhost --volfile-id gluster/scrub -p /var/run/gluster/scrub/scrub.pid -l /var/log/glusterfs/scrub.log -S /var/run/gluster/775ff10403118051.socket --global-timer-wheel
root        1609       1  0 Nov06 ?        00:08:57 /usr/sbin/glusterfs -s localhost --volfile-id shd/volume2 -p /var/run/gluster/shd/volume2/volume2-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/143682d2ae48b0c0.socket --xlator-option *replicate*.node-uuid=GUID1 --process-name glustershd --client-pid=-6
root       96093   95946  0 02:47 pts/1    00:00:00 grep gluster

Expected results:
systemctl stop glusterd.service stops all volumes and processes, including bitrot and self-heal. I could also see it make sense to have self-heal and bitrot daemons as separate services, but regardless there should be a way to reliably stop any ststemd-started gluster process via systemctl.

- The operating system / glusterfs version:

debian 10 buster / debian 11 bullseye

glusterfs 8.2-1. Also true for 8.0, 8.1, and I think 7.x. It was not the case for 5.x IIRC.

@Legogris Legogris changed the title debian systemd service files don't stop volumes properly debian systemd service files don't stop volumes and daemons properly Nov 8, 2020
@Legogris Legogris changed the title debian systemd service files don't stop volumes and daemons properly debian systemd services don't stop volumes and daemons properly Nov 8, 2020
@jronnblom
Copy link

Seen something like that.

It looks like the processes are sent SIGKILL first and not SIGTERM by systemd. Maybe the glusterd.service needs some updates.

There is a script /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh that can be used to shutdown gluster.

@stale
Copy link

stale bot commented Jun 18, 2021

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Jun 18, 2021
@3nprob
Copy link

3nprob commented Jun 19, 2021

Still an issue

@stale stale bot removed the wontfix Managed by stale[bot] label Jun 19, 2021
@aravindavk
Copy link
Member

It is intentional to not stop all the processes if Glusterd is Stopped. If Bricks are up then already connected Clients/Mounts continue to work even if Glusterd goes down. Think about restarting a Glusterd to fix an issue or to fix a memory leak, This doesn't mean all the other services should be stopped.

For now you can use the script that @jronnblom suggested #1767 (comment)

@3nprob
Copy link

3nprob commented Jun 19, 2021

@aravindavk The general contract should be that whatever is being brought up by systemctl start, should also be brought down by systemctl stop.

As there currently is nothing implemented for reload for glusterd.service, perhaps the scenario you describe can be addressed by reload rather than restart?

The alternative would be service splitting, either a general glusterd-bricks/glusterd-volumes or glusterd-brick@foobar.

@stale
Copy link

stale bot commented Jan 15, 2022

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Jan 15, 2022
@3nprob
Copy link

3nprob commented Jan 15, 2022

Still an issue

@stale stale bot removed the wontfix Managed by stale[bot] label Jan 15, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Sep 21, 2022
@stale
Copy link

stale bot commented Oct 22, 2022

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed Oct 22, 2022
@ronnyadsetts
Copy link

This is still an issue.

When a node is shut down or rebooted, the gluster volumes with bricks on the server affected all hang for the 42 seconds default timeout. This is entirely avoidable by killing all gluster processes properly during shutdown.

I understand the logic behind the glusterd.service 'restart' but it gets really annoying when all my VMs go unresponsive because an overheat caused a graceful server shutdown.

@ethaniel
Copy link

Still an issue in 9.4.

@jetibest
Copy link

jetibest commented Jun 4, 2023

Got the same issue. On Debian I don't have instances of glusterfsd, only glusterd that spawns multiple processes, which are not killed when glusterd is stopped. It's because of KillMode=process in /lib/systemd/system/glusterd.service.

To resolve this issue, I am using the following:

/etc/systemd/system/glusterd.service.d/override.conf:

[Service]
KillMode=control-group

Use systemctl daemon-reload to apply the changes.

Remember what man systemd.kill says about this:

Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

Either the KillMode should be set to control-group, or there should be separate glusterfsd services. I remember reading an issue where glusterfs didn't want to mount, because the address (port) was already in use. After a reboot it was resolved. I had the same issue, but then I found out it's because of these glusterfsd processes lingering after a restart of the glusterd service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix Managed by stale[bot]
Projects
None yet
Development

No branches or pull requests

7 participants