Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate metric events in Telemetry endpoint #536

Closed
eliasbrange opened this issue Dec 4, 2017 · 8 comments
Closed

Duplicate metric events in Telemetry endpoint #536

eliasbrange opened this issue Dec 4, 2017 · 8 comments

Comments

@eliasbrange
Copy link
Contributor

eliasbrange commented Dec 4, 2017

We use containerpilot -putmetric to gather certain fluctuating metrics such as memory usage and free disk space. We noticed that after running a container in production for some time the /metrics endpoint generates over 100k lines with lines on the form

containerpilot_events{code="Metric",source="metric|value"} 1

Since the metrics are fluctuating this means that we get a few new events every 15 seconds or so. My guess is that this will continue indefinitely until the container is restarted. For counters that can only ever go up this also means a new event line every time you do putmetric.

My question is, is this expected behaviour? Filtering in e.g. InfluxDB on source is now impossible, and the endpoint is getting slower and slower due to the massive amount of events that are returned.

Edit: I have experimented with version 3.4.2, 3.5.0 and 3.6.0 and they seem to behave the same.

@jwreagor
Copy link
Contributor

jwreagor commented Dec 5, 2017

Thanks for reporting this @eliasbrange.

Initially the behavior you're describing doesn't sound like what I'd expect on a Prom formatted endpoint. I would expect a single metric counter to increase monotonically and not output multiple lines or counters with a single event value.

Could you paste a snippet of everything else being logged around when this starts (I don't need all the lines, obviously)?

Is it just a single metric named metric|value? Are any other custom metrics performing this way, possibly named in another manner?

A test configuration/setup is always helpful as well and helps me debug quicker but I'm quite sure I can reproduce this. Sounds like I only need to push metrics into ContainerPilot.

@eliasbrange
Copy link
Contributor Author

I guess the config doesn't actually matter. And defining a metric in the config does not matter either I think. Anyways, here comes a little config.

{
  consul: "localhost:8500",
  logging: {
    level: "DEBUG",
    format: "default",
    output: "stdout"
  },
  telemetry: {
    port: 9090,
    interfaces: ["static:127.0.0.1"],
    metrics: [
      {
        namespace: "telemetry",
        subsystem: "test",
        name: "counter",
        help: "test",
        type: "counter"
      }
    ]
  }
}

Then I run the following commands with the following output:

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 0                                                    

$ for i in {1..10}; do containerpilot -putmetric telemetry_test_counter=$i; done

$ bash-4.3# curl --silent localhost:9090/metrics | grep telemetry_test_counter
> containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1   
> containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1    
> containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1    
> # HELP telemetry_test_counter test                                          
> # TYPE telemetry_test_counter counter                                       
> telemetry_test_counter 55                                                   

As you can see, each call to -putmetric with a unique value results in a new row on the /metrics endpoint.

@jwreagor
Copy link
Contributor

jwreagor commented Dec 6, 2017

Thank you, your configs definitely helped because I wasn't seeing the issue immediately until I tried your setup. So yeah, you've found a bug and possibly a race condition because hammering the endpoint using a for loop causes multiple events to be created while sending them slower increments an existing event.

Notice below that telemetry_test_counter|1 equals 2. I wouldn't have noticed that without the for loop!

$ curl --silent localhost:9090/metrics | grep telemetry_test_counter
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 2
containerpilot_events{code="Metric",source="telemetry_test_counter|10"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|2"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|3"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|4"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|5"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|6"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|7"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|8"} 1
containerpilot_events{code="Metric",source="telemetry_test_counter|9"} 1
# HELP telemetry_test_counter test
# TYPE telemetry_test_counter counter
telemetry_test_counter 56

@jwreagor jwreagor added the bug label Dec 6, 2017
@jwreagor jwreagor changed the title Metric events Hammering telemetry endpoint duplicates counter Metric events Dec 6, 2017
@jwreagor
Copy link
Contributor

jwreagor commented Dec 6, 2017

Looks like everything is working fine except logging. We're logging each unique counter event by value. This is especially prevalent when counting in varying increments which I didn't notice before.

$ for i in {1..10}; do containerpilot -config ./containerpilot.json5 -putmetric "telemetry_test_counter=1"; done
$ curl -s 172.16.20.7:4141/metrics | grep something_test
# HELP telemetry_test_counter total number of somethings
# TYPE telemetry_test_counter counter
telemetry_test_counter 10
containerpilot_events{code="Metric",source="telemetry_test_counter|1"} 10

@jwreagor jwreagor added enhancement and removed bug labels Dec 6, 2017
@jwreagor jwreagor changed the title Hammering telemetry endpoint duplicates counter Metric events Duplicate metric events in Telemetry endpoint Dec 6, 2017
@jwreagor
Copy link
Contributor

jwreagor commented Dec 6, 2017

We're reporting events through our event bus Prometheus collector. It looks like I can remove the value portion of the event name when collecting Metric events but I'm not sure of the repercussions of that. I believe that's only used for reporting so that's one option.

Another option would be to stop recording Metric events all together since end users use their Prometheus counter values themselves. Why do we need duplicate information through the endpoint? Do we really need to know about each Metric event if the counter or gauge is what we're ultimately deriving value out of?

Any thoughts? Going to think about this.

@jwreagor
Copy link
Contributor

jwreagor commented Dec 6, 2017

After adding a single if statement around Metric collection, I'm feeling good about option 2 and removing Metric event collection all together. In hindsight if feels a bit overzealous. Also, I don't believe we had many consumers of the telemetry endpoint in that regard so it's probably safe to remove those events.

@eliasbrange
Copy link
Contributor Author

Thanks for the quick iterations on this one @cheapRoc. Much appreciated.

@jwreagor
Copy link
Contributor

jwreagor commented Dec 7, 2017

Just released in version 3.6.1, thanks again for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants