Skip to content

Troubleshooting monitoring alerts

Deepak Narayana Rao edited this page Nov 9, 2017 · 12 revisions

Alerts

backup_is_too_old

This alert indicates no new backup has been uploaded to azure blob storage for a while

Troubleshooting

  • Check the jenkins job *_Backup has run and succeeded in expected duration

backup_size_is_too_small

This alert indicates size backup file uploaded to azure blob is less than expected size. This would happen if backup script produces dummy file due to failure as happened in gitlab database incident

Troubleshooting

  • Check the jenkins job *_Backup has run and succeeded without errors or warnings which might produce dummy backup file

docker_swarm_node_down

This alert indicates one of node in docker swarm is down. This could happen either if the server running docker crashes(likely) or loses connectivity to swarm(less likely)

Troubleshooting

  • SSH into docker swarm manager and check which node is down by executing docker node ls. Try to SSH to the node which is down and see if it can ping the swarm master

Action Items

high_memory_usage_on_container

This alert indicates container is using over 90% memory. If it reaches the memory limit configured in docker swarm, this container will get restarted due to OutOfMemory error.

Troubleshooting

  • Open Container Details dashboard in grafana
  • Select the time range between which alert was generated
  • Select the container interested in all the graphs in this dashboard
  • "Memory Usage" and "Memory Usage %" graphs shows memory usage over time

Action Items

  • If this is unexpected behaviour for this application, debug the application running for memory leak issues
  • If high memory usage is expected across all environments, increment the default memory limit for this service in the ansible role defaults/main.yml in public repo
  • If high memory usage is expected in a specific environment, increment the memory limit for this service in the inventory group_vars of this environment

monitoring_service_down

This indicates the exporter configured for scraping metrics is not reachable by prometheus. This is usually because exporter may not be running due to an issue

Troubleshooting

service_down

This alert indicates service is down as the url of the service or tcp port of service is not reachable

Troubleshooting

  • Open Availability dashboard in grafana
  • Select the time range between which alert was generated
  • Select the url interested in all the graphs in this dashboard
  • "Availability" graph shows up(value 1) and down(value 0) status for the url
  • "Probe Time" graph shows the time taken for accessing url
  • "Probe Status Code" shows the HTTP status code for the url. Status code 0 is returned when a url is unreachable or connection timed out after 5 seconds (configured time out value for urls)
  • Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"
  • If the logs doesn't not have enough details, check service info following steps in Docker swarm management commands

Action Items

service_replication_failure

This alert indicates the docker swarm is not able to launch all replicas for this service

Troubleshooting

  • Check service info following steps in Docker swarm management commands
  • Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"

Action Items

service_flapping

This alert indicates the service status has been flapping between up down for specified time

Troubleshooting Same as service_down

service_replication_failure

This alert indicates the docker swarm is not able to launch all replicas for this service

Troubleshooting

  • Check service info following steps in Docker swarm management commands
  • Search for logs of this service in Kibana -> Discover -> Search program: "*<service-name>*". Check if there are any errors related to service. Example search query: program: "*learner-service*"

too_many_server_side_http_errors

This alert indicates there were too many 5xx HTTP status code(server side errors) responses are logged in proxy(nginx) logs

Troubleshooting

  • Search for proxy logs in Kibana -> Discover -> Search program: *proxy* AND (message: "HTTP/1.1 500" OR message: "HTTP/1.1 503"). In the log message, check the request paths for which these error logs are generated
Clone this wiki locally