Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authentik worker become "unhealthy" and never recover after restarting reddis docker container #6221

Closed
freender opened this issue Jul 11, 2023 · 7 comments
Labels
bug Something isn't working wontfix

Comments

@freender
Copy link

freender commented Jul 11, 2023

Describe the bug
Authentik worker become "unhealthy" and never recover after restarting reddis docker container

To Reproduce
Steps to reproduce the behavior:

  1. Check if authentik worker is up and running
docker inspect auth-worker | grep Status

Actual Result = Expected Result

"Status": "running",
    "Status": "healthy",
  1. Restart "reddis" docker container
  2. Run worker healthcheck
docker exec auth-worker /lifecycle/ak healthcheck

Actual Result:
Worker lost reddis connectivity, the only option to fix is to restart the authentik worker

root@NAS:~# docker exec auth-worker /lifecycle/ak healthcheck
{"event":"checking health","level":"debug","mode":"worker","timestamp":"2023-07-11T14:26:48-04:00"}
{"delta":104.817957282,"event":"Worker hasn't updated heartbeat in threshold","level":"warning","threshold":30,"timestamp":"2023-07-11T14:26:48-04:00"}

Expected behavior
Please investigate if we can detect such cases and automatically recover. As for now the only option is to restart 'authentik worker'

Version and Deployment (please complete the following information):

  • authentik version: 2023.6.1
  • Deployment: docker 20.10.24, unraid
@freender freender added the bug Something isn't working label Jul 11, 2023
@a-gerhard
Copy link
Contributor

I can confirm this. It seems like once the worker has successfully connected to redis, and then the redis connection is lost, the worker does not handle the resulting Exception (where it should be trying to re-establish the connection).

This is an issue for us, because we use Watchtower to keep our containers up-to-date, and therefore Redis container is recreated regularly in our setup.

Solution: Either have the worker exit (and therefore restart) when the Redis connection becomes unavailable, or find a way to try to re-connect to redis if a connection loss is detected.

@mgrimace
Copy link

I'm experiencing the authentik-worker becoming unhealthy using image: ghcr.io/goauthentik/server:2023.6.1no specific actions or changes, just noticed in portainer.

@a-gerhard
Copy link
Contributor

As a workaround until this is fixed, I have set up autoheal for the workers.

add this to docker-compose.yml

  autoheal:
    restart: always
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=autoheal
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

and then add the autoheal label to your worker service:

    labels:
      autoheal: "true"

@mgrimace
Copy link

mgrimace commented Sep 8, 2023

autoheal:
restart: always
image: willfarrell/autoheal
environment:
- AUTOHEAL_CONTAINER_LABEL=autoheal
volumes:
- /var/run/docker.sock:/var/run/docker.sock

Thank you for this, I noticed that others in this thread were also using watchtower, and I tested adding the label:

   
labels:
      com.centurylinklabs.watchtower.enable: false

to each service in the Authentik stack.

This also appears to have solved the issue for me (at least in the short term). Perhaps something to do with Watchtower attempting to update/restart(?) redis, which is not on a fixed version, while the worker remains on a fixed version. My knowledge is limited in this area, but hopefully another piece of the puzzle.

@authentik-automation
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@authentik-automation authentik-automation bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2023
@mooglestiltzkin
Copy link

i noticed this issue as well. can we expect a permanent fix?

yes i also use watchtower. I noticed the correlation between straight after receiving watchtower email notification about restarting authentik (probably because it was updating), after that event then the worker became unhealthy i noticed.

@keliansb
Copy link

I'm also facing this issue, and using Watchtower to update the database and Redis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix
Projects
None yet
Development

No branches or pull requests

5 participants