-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: track stats of end-to-end message processing time #280
Conversation
You're a machine @mynameisfiber |
@davemarchevsky just wait until the comments start flowing in |
so we meet again trebek |
@@ -348,13 +357,34 @@ func (c *Channel) TouchMessage(clientID int64, id nsq.MessageID) error { | |||
return nil | |||
} | |||
|
|||
func (c *Channel) processMessagePercentile() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to add another goroutine to the channel to manage this data.
There are a couple reasons...
- as written there's a race between some other goroutine that calls
Exit()
(which closesc.messagePercentilesChan
) and any other goroutine that callsFinishMessage()
(which sends onc.messagePercentilesChan
). We've handled these cases in a variety of ways, but in short, its a pain in the ass. - the overhead. I really want to minimize overhead from this on the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine... we can do it another way also that maintains the state of the last reset and just does a .Reset()
on the quantile if it's too old.
I had originally done this but opted out because if a channel hasn't had a finished message in a while, the data will be old and un-reset. There are ways around this though, by having a custom interface into the quantile object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to literally inline it in FinishMessage()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly... originally it did:
c.removeFromInFlightPQ(item)
msgStartTime := item.Value.(*inFlightMessage).msg.Timestamp
nowNano := time.Now().UnixNano()
if msgStartTime > c.lastPercentileResetTime + timeToReset {
c.messagePercentilesStream.Reset()
c.lastPercentileResetTime = nowNano
}
c.messagePercentilesStream.Insert(nowNano - msgStartTime)
return nil
by the way I laughed at the soccer in the face kid basically all day today |
hahaha, that's great. i've got a couple more gif's up my sleeve depending on how this pull request goes ;) |
This is the 25th comment and it's going well so far... I estimate we'll be in the 50ish range before we've merged. |
If @jehiah gets involved all bets are off. |
If @ploxiln gets involved I'm leaving. |
@mreiferson if this gets to 100 comments I'm leaving.... I think that encompasses either @jehiah or @ploxiln joining. |
does chiming in to say "hah don't worry about that" count as "getting involved" |
FYI, if you want to compare before/after performance you can use # setup
$ nsqd --mem-queue-size=1000000
$ bench_writer
# this creates the channel so the topic drain into the channel isn't part of the benchmark
$ curl 'http://127.0.0.1:4151/create_channel?topic=sub_bench&channel=ch'
# this is the one that is interesting
$ bench_reader |
@mreiferson some timings -- master:
this branch:
|
That being said, I've run the benchmarks several times and the values for both branches vary within the same range... I'd say that the average timings for this branch vs master are much closer than their variances. |
That's good news. Does that afford us any opportunities for smoothing over the issues relating to how we're resetting? |
can you paste the |
statsdMemStats = flag.Bool("statsd-mem-stats", true, "toggle sending memory and GC stats to statsd") | ||
statsdPrefix = flag.String("statsd-prefix", "nsq.%s", "prefix used for keys sent to statsd (%s for host replacement)") | ||
messagePercentiles = util.FloatArray{} | ||
messagePercentileResetTime = flag.Float64("message-percentile-reset-time", 600, "time in seconds between message percentile data flushes (default: 600)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a flag.Duration
(which is really convenient)
As for the smoothing, we can always roll between two quantiles and then merge them when results are requested. That would make the queries for the percentiles a bit more expensive, but since that's not per message I'm alright with that. Thoughts? |
thats still pretty unwieldy... I suppose all you really want is secs, ms, us so instead of:
we would have
and if we had a unit smaller than a second...
Maybe we roll our own? |
Benchmarks with 2 rolling quantiles:
The benchmarks are being quite variable... I think my laptop is a bit overloaded at the moment. |
this is what I get: # before
2013/10/22 17:15:26 duration: 629.3911ms - 30.305mb/s - 158883.721ops/s - 6.294us/op
# after
2013/10/22 17:12:54 duration: 700.623892ms - 27.224mb/s - 142729.931ops/s - 7.006us/op I should really write a little |
New timing format:
The last question remaining is how should we display this in the nsqadmin page? Since we are taking data from multiple nsqd's, I'm not sure what the best way to aggregate the data is. I'm thinking for each percentile we show the min/max value. Thoughts? |
statsdAddress string | ||
statsdPrefix string | ||
statsdInterval time.Duration | ||
e2eProcessingLatencyWindowTime time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline before for diff and group with var below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
could you elaborate re: the issues with adding graphs? |
@mreiferson the main thing is just UI.... we are already inserting a lot of data so displaying it inline in the e2e cell would cramp things up. I could add another column, but then we probably go off screen. Any thoughts? |
@mreiferson and should the graph show all available percentiles on one plot? multiple sparklines? toggle between them? here is the issue. |
yea, I think for sparkline sized graphs... maybe just the aggregate textual values are better... |
} | ||
*a = append(*a, v) | ||
} | ||
sort.Sort(sort.Reverse(sort.Float64Slice(*a))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort.Reverse
doesn't work on go1.0.3
😦
we've got an epic going on over here |
@mynameisfiber I played around with the table layout, I think it's an improvement: I had to drop the sparklines, though... Lemme know what you think Pickup the commit at mreiferson@d24d773 |
For the docs:
|
This commit adds new statistics about how long individual messages take from being published to being finished. We maintain several percentile values, the exactly quantiles which can be specified as runtime parameters to nsqd. In addition, this information is displayed quite shnazily in nsqadmin.
👍 💯 🎆 |
nsqd: track stats of end-to-end message processing time
As per #268, this pull request adds the ability to track statistics about the time it takes for messages to get
.finished()
. This is done by maintaining a probabilistic percentile calculation over each channel (using the perks package) and merging them when topic quantiles are requested.This data is surfaced in:
/stats
callStill to do:
Sample data from
/stats
call: