nsqd: track stats of end-to-end message processing time #280

mynameisfiber · 2013-10-21T21:09:26Z

As per #268, this pull request adds the ability to track statistics about the time it takes for messages to get .finished(). This is done by maintaining a probabilistic percentile calculation over each channel (using the perks package) and merging them when topic quantiles are requested.

This data is surfaced in:

The /stats call
statsd
admin interface

Still to do:

Before/after timings
Docs

Sample data from /stats call:

$ curl "localhost:4151/stats"
nsqd v0.2.23 (built w/go1.1.2)

[test           ] depth: 0     be-depth: 0     msgs: 591      e2e%: 13.5s, 13.5s, 683.9us, 375.1us, 269.6us
    [test_chan                ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 591      e2e%: 13.5s, 13.5s, 1.1ms, 394.7us, 268.8us
        [V2 muon:55712           ] state: 3 inflt: 0    rdy: 62   fin: 591      re-q: 0        msgs: 591      connected: 7m18s

davemarchevsky · 2013-10-21T21:19:21Z

You're a machine @mynameisfiber

mynameisfiber · 2013-10-21T21:21:58Z

@davemarchevsky just wait until the comments start flowing in

mreiferson · 2013-10-22T00:43:29Z

so we meet again trebek

mreiferson · 2013-10-22T00:51:17Z

nsqd/channel.go

@@ -348,13 +357,34 @@ func (c *Channel) TouchMessage(clientID int64, id nsq.MessageID) error {
 	return nil
 }

+func (c *Channel) processMessagePercentile() {


I'm not sure we want to add another goroutine to the channel to manage this data.

There are a couple reasons...

as written there's a race between some other goroutine that calls Exit() (which closes c.messagePercentilesChan) and any other goroutine that calls FinishMessage() (which sends on c.messagePercentilesChan). We've handled these cases in a variety of ways, but in short, its a pain in the ass.

the overhead. I really want to minimize overhead from this on the fast path.

That's fine... we can do it another way also that maintains the state of the last reset and just does a .Reset() on the quantile if it's too old.

I had originally done this but opted out because if a channel hasn't had a finished message in a while, the data will be old and un-reset. There are ways around this though, by having a custom interface into the quantile object.

My suggestion is to literally inline it in FinishMessage()

Exactly... originally it did:

c.removeFromInFlightPQ(item) msgStartTime := item.Value.(*inFlightMessage).msg.Timestamp nowNano := time.Now().UnixNano() if msgStartTime > c.lastPercentileResetTime + timeToReset { c.messagePercentilesStream.Reset() c.lastPercentileResetTime = nowNano } c.messagePercentilesStream.Insert(nowNano - msgStartTime) return nil

mreiferson · 2013-10-22T01:20:43Z

by the way I laughed at the soccer in the face kid basically all day today

mreiferson · 2013-10-22T01:21:55Z

but this one is great too

mynameisfiber · 2013-10-22T01:23:09Z

hahaha, that's great. i've got a couple more gif's up my sleeve depending on how this pull request goes ;)

mreiferson · 2013-10-22T01:24:44Z

This is the 25th comment and it's going well so far... I estimate we'll be in the 50ish range before we've merged.

mreiferson · 2013-10-22T01:25:02Z

If @jehiah gets involved all bets are off.

mreiferson · 2013-10-22T01:25:21Z

If @ploxiln gets involved I'm leaving.

mynameisfiber · 2013-10-22T01:28:13Z

@mreiferson if this gets to 100 comments I'm leaving.... I think that encompasses either @jehiah or @ploxiln joining.

ploxiln · 2013-10-22T02:21:20Z

does chiming in to say "hah don't worry about that" count as "getting involved"

mreiferson · 2013-10-22T18:39:43Z

FYI, if you want to compare before/after performance you can use bench_reader/bench_writer:

# setup
$ nsqd --mem-queue-size=1000000
$ bench_writer
# this creates the channel so the topic drain into the channel isn't part of the benchmark
$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
# this is the one that is interesting
$ bench_reader

mynameisfiber · 2013-10-22T19:12:47Z

@mreiferson some timings --

master:

(origin/master)$ bench_writer
2013/10/22 15:08:51 duration: 5.043290723s - 37.820mb/s - 198283.235ops/s - 5.043us/op
(origin/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(origin/master)$ bench_reader
2013/10/22 15:09:06 duration: 7.055659747s - 27.033mb/s - 141730.190ops/s - 7.056us/op

this branch:

(micha/master)$ bench_writer
2013/10/22 15:06:24 duration: 3.298285488s - 57.828mb/s - 303187.824ops/s - 3.298us/op
(micha/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(micha/master)$ bench_reader
2013/10/22 15:06:45 duration: 7.20925809s - 26.457mb/s - 138710.529ops/s - 7.209us/op

mynameisfiber · 2013-10-22T19:15:27Z

That being said, I've run the benchmarks several times and the values for both branches vary within the same range... I'd say that the average timings for this branch vs master are much closer than their variances.

mreiferson · 2013-10-22T19:18:10Z

That's good news. Does that afford us any opportunities for smoothing over the issues relating to how we're resetting?

mreiferson · 2013-10-22T19:19:09Z

can you paste the /stats output after a test run on your branch (curious how it looks with some real data)

mreiferson · 2013-10-22T19:21:32Z

nsqd/main.go

+	statsdMemStats                 = flag.Bool("statsd-mem-stats", true, "toggle sending memory and GC stats to statsd")
+	statsdPrefix                   = flag.String("statsd-prefix", "nsq.%s", "prefix used for keys sent to statsd (%s for host replacement)")
+	messagePercentiles             = util.FloatArray{}
+	messagePercentileResetTime     = flag.Float64("message-percentile-reset-time", 600, "time in seconds between message percentile data flushes (default: 600)")


this can be a flag.Duration (which is really convenient)

mynameisfiber · 2013-10-22T19:55:59Z

$ curl "localhost:4151/stats"
nsqd v0.2.23-alpha (built w/go1.1.2)

[cluster_test1382469115] depth: 0     be-depth: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0
    [ch                       ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0

[sub_bench      ] depth: 0     be-depth: 0     msgs: 1000000  percentiles: 1m34.056548253s, 1m34.056548253s, 23.635964732s, 23.256514065s, 19.914634067s
    [ch                       ] depth: 800   be-depth: 0     inflt: 160  def: 0    re-q: 0     timeout: 0     msgs: 1000000  percentiles: 1m34.056548253s, 1m34.056548253s, 23.49301998s, 23.124245376s, 19.833569623s

[test           ] depth: 0     be-depth: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0
    [test_chan                ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 0        percentiles: 0, 0, 0, 0, 0

As for the smoothing, we can always roll between two quantiles and then merge them when results are requested. That would make the queries for the percentiles a bit more expensive, but since that's not per message I'm alright with that.

Thoughts?

mreiferson · 2013-10-22T20:05:48Z

thats still pretty unwieldy...

I suppose all you really want is secs, ms, us so instead of:

percentiles: 1m34.056548253s, 1m34.056548253s, 23.49301998s, 23.124245376s, 19.833569623s

we would have

percentiles: 94.1s, 94.1s, 23.5s, 23.1s, 19.8s

and if we had a unit smaller than a second...

percentiles: 5s, 4s, 2s, 500ms, 885us

Maybe we roll our own?

mynameisfiber · 2013-10-22T20:58:43Z

Benchmarks with 2 rolling quantiles:

(micha/master)$ bench_writer
2013/10/22 16:53:00 duration: 8.926523399s - 21.367mb/s - 112025.696ops/s - 8.927us/op
(micha/master))$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(micha/master)$ bench_reader
2013/10/22 16:53:33 duration: 20.909254323s - 9.122mb/s - 47825.713ops/s - 20.909us/op

(origin/master)$ bench_writer
2013/10/22 16:55:47 duration: 8.712874453s - 21.891mb/s - 114772.686ops/s - 8.713us/op
(origin/master)$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
{"status_code":200,"status_txt":"OK","data":null}
(origin/master)$ bench_reader
2013/10/22 16:56:08 duration: 14.900942757s - 12.800mb/s - 67109.848ops/s - 14.901us/op

The benchmarks are being quite variable... I think my laptop is a bit overloaded at the moment.

mreiferson · 2013-10-22T21:16:38Z

this is what I get:

# before
2013/10/22 17:15:26 duration: 629.3911ms - 30.305mb/s - 158883.721ops/s - 6.294us/op

# after
2013/10/22 17:12:54 duration: 700.623892ms - 27.224mb/s - 142729.931ops/s - 7.006us/op

I should really write a little bench.sh script, you might be neglecting to rm -f *.dat (and restart nsqd) before you run a bench iteration...

mynameisfiber · 2013-10-22T21:56:21Z

New timing format:

$ curl "localhost:4151/stats"
nsqd v0.2.23-alpha (built w/go1.1.2)

[sub_bench      ] depth: 0     be-depth: 0     msgs: 1852000  percentiles: 91.0s, 91.0s, 90.4s, 89.2s, 35.8s
    [ch                       ] depth: 0     be-depth: 0     inflt: 210  def: 0    re-q: 0     timeout: 0     msgs: 1852000  percentiles: 91.0s, 91.0s, 90.4s, 89.2s, 35.6s

The last question remaining is how should we display this in the nsqadmin page? Since we are taking data from multiple nsqd's, I'm not sure what the best way to aggregate the data is. I'm thinking for each percentile we show the min/max value. Thoughts?

mreiferson · 2013-10-24T17:57:12Z

nsqd/nsqd.go

+	statsdAddress                   string
+	statsdPrefix                    string
+	statsdInterval                  time.Duration
+	e2eProcessingLatencyWindowTime  time.Duration


newline before for diff and group with var below

mreiferson · 2013-10-24T19:04:53Z

could you elaborate re: the issues with adding graphs?

mynameisfiber · 2013-10-24T19:10:09Z

@mreiferson the main thing is just UI.... we are already inserting a lot of data so displaying it inline in the e2e cell would cramp things up. I could add another column, but then we probably go off screen. Any thoughts?

mreiferson · 2013-10-24T19:20:16Z

I expect this to be be "VIP metrics", I'd want them to be easily accessible...

and at the top of the topic and channel pages (showing aggregate across all producers) for each quantile

mynameisfiber · 2013-10-24T19:30:27Z

@mreiferson and should the graph show all available percentiles on one plot? multiple sparklines? toggle between them? here is the issue.

mreiferson · 2013-10-24T19:36:23Z

yea, I think for sparkline sized graphs... maybe just the aggregate textual values are better...

mreiferson · 2013-10-24T19:51:35Z

util/float_array.go

+		}
+		*a = append(*a, v)
+	}
+	sort.Sort(sort.Reverse(sort.Float64Slice(*a)))


sort.Reverse doesn't work on go1.0.3 😦

ploxiln · 2013-10-25T17:57:23Z

we've got an epic going on over here

mynameisfiber · 2013-10-28T20:13:40Z

New look of nsqadmin with the graphs... thoughts?

mreiferson · 2013-10-29T02:34:38Z

@mynameisfiber I played around with the table layout, I think it's an improvement:

I had to drop the sparklines, though... Lemme know what you think

Pickup the commit at mreiferson@d24d773

mynameisfiber · 2013-10-29T02:39:39Z

That looks awesome! Why do you need to drop the sparklines? We could still have one underneith with for the highest percentile and rowspan set to span the entire "e2e processing latency" section. Now sure how easy that'll be with go/html/template though.

airsnakes++

mynameisfiber · 2013-10-29T18:41:30Z

how i feel about this probabilistic approach working:

mynameisfiber · 2013-10-29T18:55:14Z

For the docs:

The end to end message latency is calculated by using the probabilistic
percentile technique described in ["Effective Computation of Biased Quantiles
over Data Streams"](http://www.cs.rutgers.edu/~muthu/bquant.pdf).  We use
bmizerany's implemintation of this algorithm in the
[perks](http://godoc.org/github.com/bmizerany/perks) library.

In order to only show quantile information for the past 10minutes, we
internally have two quantiles that each store 5 minutes worth of latency data
for each channel.  Once a quantile has been used for 5 minutes, we clear the
other quantile and start inserting new data into it.  Since quantiles can be
merged, this means we can have a coarse rolling window.

In addition, all channel quantiles are merged to get topic latency information.
This can be done if the channels are on the same nsqd host, however when data
is being accumulated through lookupd we take averages over the quantile value
for each nsqd host.  In order to maintain some statistics regarding the
distribution of the latencies over multiple hosts, we provide the min/max
values in addition to the average.

mynameisfiber · 2013-10-29T20:32:28Z

Final state of nsqadmin:

Mouseovers on sparklines review preview of full graph

This commit adds new statistics about how long individual messages take from being published to being finished. We maintain several percentile values, the exactly quantiles which can be specified as runtime parameters to nsqd. In addition, this information is displayed quite shnazily in nsqadmin.

mreiferson · 2013-10-29T20:47:31Z

👍 💯 🎆

nsqd: track stats of end-to-end message processing time

mynameisfiber · 2013-10-29T20:59:10Z

How I thought this pull request would go:

How it actually went:

mreiferson reviewed Oct 22, 2013
View reviewed changes

mreiferson reviewed Oct 24, 2013
View reviewed changes

mreiferson mentioned this pull request Oct 29, 2013

nsqd: track stats of end-to-end message processing time #268

Closed

mreiferson added a commit that referenced this pull request Oct 29, 2013

Merge pull request #280 from mynameisfiber/master

fa88eb6

nsqd: track stats of end-to-end message processing time

mreiferson merged commit fa88eb6 into nsqio:master Oct 29, 2013

nsqd: track stats of end-to-end message processing time #280

nsqd: track stats of end-to-end message processing time #280

Conversation

mynameisfiber commented Oct 21, 2013

davemarchevsky commented Oct 21, 2013

mynameisfiber commented Oct 21, 2013

mreiferson commented Oct 22, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

ploxiln commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mreiferson commented Oct 22, 2013

Choose a reason for hiding this comment

mynameisfiber commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

mreiferson commented Oct 22, 2013

mynameisfiber commented Oct 22, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Oct 24, 2013

mynameisfiber commented Oct 24, 2013

mreiferson commented Oct 24, 2013

mynameisfiber commented Oct 24, 2013

mreiferson commented Oct 24, 2013

Choose a reason for hiding this comment

ploxiln commented Oct 25, 2013

mynameisfiber commented Oct 28, 2013

mreiferson commented Oct 29, 2013

mynameisfiber commented Oct 29, 2013

mynameisfiber commented Oct 29, 2013

mynameisfiber commented Oct 29, 2013

mynameisfiber commented Oct 29, 2013

mreiferson commented Oct 29, 2013

mynameisfiber commented Oct 29, 2013