Docker swarm troubleshooting

Deepak Narayana Rao edited this page Oct 17, 2017 · 2 revisions

1. Service not starting or gets killed often

Run the below command in docker swarm master to find cause of the issue

docker service ps <service-name> --no-trunc

If you don't see any error above, you can check the logs of the service using below command

docker service logs <service-name> --tail 200

Know Issue 1.1: Starting container failed: Address already in use

Know Issue 2.1: Exited (137)

The exit code 137(128+9) is due to KILL signal 9. This happens when the container gets killed by SIGKILL (9). This can happen when container gets killed due to OutOfMemory Error.

  • Docker swarm issue: https://github.com/moby/moby/issues/21083#issuecomment-239578836
  • Remediation:
    • Confirm if it is due to memory by looking into memory usage metrics for this service container in grafana dashboard
    • If it is memory issue, increase the value of reservation_memory and limit_memory for this service and deployment variables
    resources:
      reservations:
        memory: <new_value_for_reservation_memory>
      limits:
        memory: <new_value_for_limit_memory>
    
    • Ensure your application heap size around 1/3rd or 1/2 of the memory reserved. Applications would need memory other heap memory for metadata and other resources

Know Issue 2.2: task: non-zero exit (137): dockerexec: unhealthy container

This occurs if the container health check fails. Docker swarm will stop the unhealthy container and launches a new container till it is healthy. Please check the logs for this service in kibana and understand the root cause and fix the health check endpoint in service

  • Remediation:
    • If the health check timeout is small increase the timeout accordingly in deployment scripts
    • If the health check is failing due to failing upstream without which this service can't work, fix the upstream service issue
    • If the health check is failing due to failing upstream without which this service can work, ensure health check endpoint doesn't fail due to failing upstream. Have a timeout for external service calls to ensure it doesn't block indefinitely

2. Docker swarm worker node is down

Docker worker node shows as down on executing docker node ls.

Remediation: SSH into the agent node shown as down and run sudo service docker restart

If you are unable to SSH into server, you would have to restart the server azure portal

If service restart doesn't resolve the issue, you need to execute commands to make this docker agent join as worker node. Follow the below steps

  • SSH into docker swarm master and run docker swarm join-token worker and copy the output which looks like docker swarm join --token <token> <master_address>
  • SSH into docker swarm worker node and run the copied command above docker swarm join --token <token> <master_address>
  • If you get an error as shown below
ops@swarmm-agentpublic-18950373000009:~$ docker swarm join --token SWMTKN-1-41nve2rhdm8rpa3dp93567zkaas0h94y807e6j7n8d8utzu35s-4385lo005rdp3qs395hc4ul3m 172.16.0.5:2377
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
ops@swarmm-agentpublic-18950373000009:~$ docker swarm leave
Error response from daemon: context deadline exceeded

It could be due to below issue

Know Issue 2.1: Unable to join swarm

  • Docker swarm issue: https://github.com/moby/moby/issues/25432#issuecomment-303414091
  • Remediation:
    • SSH into worker node and run
    sudo cp -r /var/lib/docker/swarm /tmp/swarm-backup
    sudo service docker stop
    sudo rm -rf /var/lib/docker/swarm
    sudo service docker start
    
    • SSH into swarm master node and execute docker node ls. You would see two nodes listed for the worker node where one node shows as down. Copy the ID of node shown as down and run docker node rm <ID>

3. Jenkins job fails while connecting to jenkins slave

For Error

Cannot contact <slave-name>: 
hudson.remoting.RequestAbortedException:
java.nio.channels.ClosedChannelException

Remediation: Re trigger the job

4. Containers within same network are unable to communicate

Eliminate the possibility of configuration issue before trying to remediate using steps below

  • Docker Swarm Issue: https://github.com/docker/swarm/issues/2161
  • Remediation:
    • Restart the container of the service by executing
     docker service scale <service-name>=0
     docker service scale <service-name>=<expected-replication>
    
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.