Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert if docker container stops/dies #1504

Closed
mokshpooja opened this Issue Mar 24, 2016 · 27 comments

Comments

Projects
None yet
@mokshpooja
Copy link

mokshpooja commented Mar 24, 2016

Hi Guys,
I hope you can guide. I have my setup on AWS where I am trying to monitor several containers, using cAdvisor + Prometheus + Alert manager. What I want to do it launch an email alert (with service/container name) if a container goes down for some reason. Problem is that if a container dies there are no metrics collected about this container by cAdvisor. Thus any query results into "no data" since there are no matches for the query.
Eg: container_cpu_usage_seconds_total{com_docker_compose_service="service1"}<=0
Would not work since there is no data to match with.
Is there a work around to launch an alert if 1 container dies?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 24, 2016

There's no easy way to do this. Rather than asking "did a container die" it's better to ask "do I have enough containers running?" or "is my latency acceptable?", as a dead container doesn't automatically mean that there's any user impact or human involvement required. This can be done by aggregation on up, and another alert using absent() incase they all die.

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Mar 24, 2016

@brian-brazil is there an example of the aggregation of up() and another alert?
I am really trying to understand how the different metrics and rules work. Some examples would really help!
Thanks

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 24, 2016

sum(up{job="myjob"}) < 1234 and absent(up{job="myjob"}) would be the basic forms.

@trompx

This comment has been minimized.

Copy link

trompx commented Mar 29, 2016

Hello,

I have the same problem and was wondering if at least it would be possible in case of no data to still have the instance label returned?

I have an inhibit rule that is supposed to mute all alerts for down containers when cadvisor is down, based on the criteria equal: ['instance'] so it doesn't work in this case as no labels are returned.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 29, 2016

I have the same problem and was wondering if at least it would be possible in case of no data to still have the instance label returned?

If there's no data, we can't return anything as there's nothing to work off.

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Mar 29, 2016

@brian-brazil is there any description of how up works?
My understanding is that up works with a job i.e. by checking upon output from a target ip/port. In my case the target is the output of cAdvisor:8080 container.
This means if I could create a job/target for each of the container:port that I want to monitor I would have been able to use your suggestions mentioned above with "sum(up{job="myjob"}) < 1234 and absent(up{job="myjob"}) would be the basic forms."

My experiment- I tried creating jobs for 2 of my containers where I am running kibana:port and elasticsearch:port in the prometheus.yml (snippet below),
`scrape_configs:

  • job_name: 'cAdvisor_job'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
    • targets: ['cadvisor:8080']
      labels:
      group: 'cAdvisor1'
  • job_name: 'kibana_job'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
    • targets: ['kibana:8899']
      labels:
      group: 'kibana1'
  • job_name: 'elasticsearch_job'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
    • targets: ['elasticsearch:9200']
      labels:
      group: 'elasticsearch1'`

but than the value of up is 0 for both of these since by kibana and elasticsearch have no data output stream. Please see picture below-
capture

So I am still unsure how to use 'up{job="myjob"}' since the only job that returns value=1 is cAdvisor :(

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 29, 2016

up only works for things that you can find via service discovery and that you can scrape. This is a fundamental limitation of bottom-up service discovery, as you need to maintain a separate source of truth to know what's meant to be running.

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Mar 29, 2016

@brian-brazil "you need to maintain a separate source of truth to know what's meant to be running" that would be my docker-compose.yml
So, what I need is something that crawls the docker-compose.yml every now and then and pings the different containers and then put that data (success/failure of ping by container name) into prometheus as a target/job. What do you think?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 29, 2016

A better way would be to auto-generate alerts using absent based on your configuration management, but this is all getting very complex. You should look at monitoring services, not individual containers.

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Mar 29, 2016

@brian-brazil It could have easily solved if prometheus could return null in case there is no data matching a query such as container_cpu_usage_seconds_total{com_docker_compose_service="elasticsearch"}=null
I think it is also what @trompx wants

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 29, 2016

It returns empty in that case, Prometheus has no notion of null. It sounds like you're looking for absent()

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Mar 29, 2016

@brian-brazil any examples/documentation on absent()? I would try to experiment with it if I can figure out how best to use it.

@trompx

This comment has been minimized.

Copy link

trompx commented Mar 29, 2016

In my case, I thought as prometheus was pulling data from all the target groups (instances) of all the jobs, that in case a value was missing, it would at least pass the instance ip:port as it still scrape thoses instances thus having that info.

Anyway, I am deploying my infra with ansible, so I guess the best way is to dynamically generate the alerts/prometheus.yml file to have one job per type of containers monitored (webserver/mysql/redis/kibana etc).

Edit: I misread mokshpooja answer, so up won't return anything since some containers output nothing. Too bad..

Thank you for the info.

@mokshpooja

This comment has been minimized.

Copy link
Author

mokshpooja commented Apr 14, 2016

@trompx So here is how I am solving this based n @brian-brazil feedback.

For services that do not output or are not a separate target-

ALERT kibana_absent
  IF absent(container_cpu_usage_seconds_total{com_docker_compose_service="kibana"})
  FOR 5s
  LABELS {
    severity="page"
  }
  ANNOTATIONS {
  SUMMARY= "Instance {{$labels.instance}} down",
  DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
  }

For services with output or a target-

ALERT cadvisor_absent
  IF up{instance="cadvisor:8080"}==0
  FOR 5s
  LABELS {
    severity="page"
  }
  ANNOTATIONS {
  SUMMARY= "Instance {{$labels.instance}} down",
  DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
  }

@fabxc fabxc added kind/question and removed question labels Apr 28, 2016

@commarla

This comment has been minimized.

Copy link

commarla commented May 24, 2016

Hi,

I am working on this.
I have this metric from the consul_exporter : consul_catalog_service_node_healthy{container=~"mailer-service"} =1 when the container is present.
I if stop the container, I receive an alert with absent(consul_catalog_service_node_healthy{container=~"mailer-service"}) but it took around 5 minutes for the absent function to return 1.

My exporter is scraped every 5 seconds.

Is there a way to reduce this time ? How is it working?

Thanks,

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 24, 2016

That's due to staleness, see #398.

@commarla

This comment has been minimized.

Copy link

commarla commented May 24, 2016

Thanks @brian-brazil, I found -query.staleness-delta 5m0s the right option 👍

@Stef3478

This comment has been minimized.

Copy link

Stef3478 commented Jun 9, 2016

Hi,

I'm also working on this as well, when I use this:
absent(latency_avg_value[1s])
I'm getting the error:
Error executing query: parse error at char 30: expected type vector in call to function "absent", got matrix
Even whilst it is returning no data when the exporter is not running or only one value when the exporter is running. I'm getting error in both cases, when the exporter is running and when it's not.
Any idea how i can solve this?

@Stef3478

This comment has been minimized.

Copy link

Stef3478 commented Jun 10, 2016

Never mind, I'm using up now and this is a better way to accomplish what I want.

Thanks
Stef

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 23, 2016

Seems like this is resolved. Closing.

@juliusv juliusv closed this Jul 23, 2016

@fuzzyami

This comment has been minimized.

Copy link

fuzzyami commented Nov 27, 2016

@commarla which value did you end up using for staleness-delta? AFAIK 5m0s is the default value. Brian recommended against setting it too low (here)

@helletheone

This comment has been minimized.

Copy link

helletheone commented May 29, 2017

Soo there is at this moment no realy solutions to monitor for example 100 Containers? right?

@andrewhowdencom

This comment has been minimized.

Copy link

andrewhowdencom commented May 30, 2017

@helletheone if you're looking more broadly for a snapshot of whether ${X] containers are unavailable, kube-state-metrics does that in the metric kube_deployment_status_replicas_unavailable, or kube_replicaset_spec_replicas + kube_replicaset_status_ready_replicas.

@harakiri406

This comment has been minimized.

Copy link

harakiri406 commented Aug 31, 2017

Just leaving this here to show how I did it:
We're on Amazon AWS, we have hosts that need to be up all the time and hosts that are immutable. Immutables start their hostname with "i-"
I wanted to be able to exclude these immutables when they die BUT at the same time have alerts when an exporter service goes down. So here 's my host down check:
'count(up{instance!~"i-.*"}) by (instance) == count(up{instance!~"i-.*"} == 0) by (instance) and count(up{instance!~"i-.*"}) by (instance) != 1'
And this is what I do for service down checks, excluding alerts when all services on immutables are down:
up == 0 unless on (instance) (count(up{instance=~"i-.*"}) by (instance) == count(up{instance=~"i-.*"} == 0) by (instance) and count(up{instance=~"i-.*"}) by (instance) != 1)'

@hiscal2015

This comment has been minimized.

Copy link

hiscal2015 commented Dec 13, 2017

@mokshpooja I have the same issue, what's your final solution?

@occelebi

This comment has been minimized.

Copy link

occelebi commented Jul 18, 2018

Any improvement to monitor multiple containers on host in one rule ?

@sohel2020

This comment has been minimized.

Copy link

sohel2020 commented Dec 13, 2018

@brian-brazil @mokshpooja

https://github.com/stefanprodan/dockprom/blob/master/prometheus/alert.rules#L45
Let say in a single server I have 2 jenkins container named jenkins-kjfkfj7f and jenkins-5jkdjd

so how can I write a single alert rule for this two container? and sent right container name with alert description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.