-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid ngx.shared.dict:get_keys #82
Conversation
I hate to push back given the amount of work that went into preparing a PR, but are you sure this will have any meaningful performance impact? With per-worker counters we no longer expect dictionary keys to be incremented in the context of request handling. Counter sync functions are executed asynchronously using ngx.timer, and even if they have to wait for a bit for the dictionary to become unlocked, I don't believe that will impact request latency. When possible, I would like to avoid additional complexity, keeping the code understandable and testable. Implementing concurrency primitives (like locks) correctly is hard, and concurrent systems are notoriously difficult to reason about and test. Testing concurrent behaviour of I believe we'll need the following for me to be comfortable merging this:
I know it's a lot to ask, and I am sorry if it feels that the work you've already done is not appreciated. I hope my position here makes sense. Please let me know how you would like to proceed. |
ping @fffonion |
I totally understand your points. This is a big and ugly patch and it definitely deserves some serious evidence. So here is a little self-contained benchmark: https://gist.github.com/dolik-rce/1b1a6d844fe51654d9bd8589db85bcff. Just create benchmark/ directory in the root of the repository and copy both files in it. When you run it, it will detect if it is in the original version, or in my branch and modify the configuration accordingly. It's very artificial benchmark, but I tried to keep it minimal to make sure I measure the right thing. Here are some numbers from my machine, so you don't have to run it before reading further:
All times are in seconds and averaged over 100 or 10 calls (metric_data is really slow, so I called it less), joined together from two separate runs, first in master, second in my branch. Now to comment these numbers:
Concerning the second point: I agree that more tests are definitely needed and that they will not be easy to write. I wanted to discuss this proposal first and gather some opinions from you and other interested users of this library, before investing even more time in this. It would suck to spend few days writing tests only to be told that it all has to be rewritten because I chose wrong API or something 🙂 |
Thanks for the quick response, @dolik-rce! I think what we need to measure is not the latency of metric collection itself, but the latency of other requests that get processed by nginx while metric collection is happening. I might be missing something, but as I understand I will try to run some experiments later this week, but if you get a chance to measure impact of |
You might be right... I did measure the impact on other requests in past, but it was long before the latest performance updates. I will definitely try to benchmark that as well. |
Sorry this took so long, but I wanted to make sure I understand the benchmark results correctly. You were right, that the situation is much better with the resty counter, since the dict is not accessed that often. But I have found couple scenarios, where the long Problem 1:If Problem 2:If Problem 3:I believe, there are some "gotchas" related to the way async way nginx is handling requests and the way lua module works. It matches what I've seen in the benchmark, I' just not sure if my explanation is correct, so take following with a grain of salt. Consider this flow of events:
The result would be that (from clients point of view) request A would take at least as long as the Possible solutionsProblem 1 should be solved by this PR, because the blocking I do not know complete solution for problems 2 and 3, but making Problem 2 might be also partially mitigated by spacing the counter syncs across time. Currently all the timers are set right away in Actual numbersHere is a benchmarking code I used: https://gist.github.com/dolik-rce/58adb967288206aec7c0065dc8c8ed17. It tests three versions of the code (pre-optimization, current master and my proposal) with three distinct cases:
Longest request
There is definitely something wrong with my version, It should not take 12s when there is no metrics collection. But on the other hand, it behaves better for serial metrics and it's the only version that survived the parallel test :-) Both is because calls to Average performance
Here we can see that the optimizations done in last release really give about 40% boost in speed for regular requests, when metrics collection is not considered. Interestingly, when we start bothering the server with long metrics requests, the performance of older code is actually better. This branch performs about the same without metrics, but keeps the performance for serial metrics as well. When run with parallel metrics collection, the performance goes rapidly down, but at least it kept working. This is probably just thanks to |
Thanks for sharing the results, @dolik-rce! Your learnings and the problems you describe match my understanding of how Lua module in nginx work. Please give me a few days to run the benchmark myself and compare results, since I'd like to run it with a fewer number of metrics. 200K time series is definitely a lot: in Prometheus 1.8 that was the number active of time series supported by the storage system with default settings (across all targets!), and I think you would typically expect a single target to return no more than a few thousand time series, not hundreds of thousands. So while it's interesting to observe what happens when the number of time series gets so high that metric collection starts consuming comparable amount of CPU time to the actual service provided by nginx (which is at the core of the problems you describe), I don't think it's a use case we should focus on while measuring performance.
I've been thinking about this while reviewing #75, and it's the reason I moved It would be nice to just stagger sync intervals automatically, however ngx.sleep is not available in the |
I'm well aware that hundreds thousand of lines is way too much. I have blown it out of proportions to make the times easier to measure. The biggest export in my company that I know of is about 10k, produced by MySQL Server Exporter. For comparison, it returns those 10k metrics in about 270ms, so this lua module is only about 3x slower for the same, amount of metrics. That is pretty good, considering that it uses Go which should be faster and more optimized, since it is compiled language.
It is possible, but it's just not very pretty... You could use a busy loop, or call |
I got some benchmark results that we can discuss. I have adjusted your benchmark configuration in the following way:
I ran the test on a new "CPU-Optimized / 8 GB / 4 vCPUs" VM at Digial Ocean, raw logs are here. For latency assessment I am mostly looking at the Some observations:
I think the tail latency improvements are worth pursuing this change. I will take a closer look at the code later today or tomorrow and will leave some comments. Thanks for your paience, @dolik-rce! |
Thank you for investing all that work into the benchmark and graphs! It is much more informative than my simple tables.
Yes, that is definitely a bug, I'll try to look into it. |
key_index.lua
Outdated
-- This must happen atomically, otherwise there could be a race condition | ||
-- and other workers might create the same records at the same time | ||
-- with different values. | ||
if self.lock:wait() then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of spinning to get a lock, workers can just spin trying to get a new unused index key. This should make locking unnecessary, I believe.
What are your thoughts on this?
for _, key in pairs(keys) do
while (true) do
local N = self:sync()
if self.index[key] ~= nil then break end
N = N+1
ok, err = self.dict:safe_add(KEY_PREFIX .. N, key)
if ok then
self.dict:incr("__key_count", 1, 0)
self.keys[N] = key
self.index[key] = N
break
elseif err ~= "exists" then
return "unexpected error adding a key: " .. err
end
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small problem with this. Consider this (very weird, but possible) scenario:
- worker A calls sync, which returns N=10
- worker B calls sync, which returns N=10
- worker A creates new metric (N+1=11)
- worker A removes last metric (index[N=11] = nil)
- worker B successfully calls safe_add with N=11
Now worker B thinks that the new metric has index 11, while worker A thinks it is deleted and could later create a new metric with the same name but different index.
It is actually somewhat similar to what @fffonion wrote below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, if the same worker creates and immediately removes the metric, other workers might be able to re-use the same index. Can we make this impossible by setting the index key to empty string instead of nil
when a metric is deleted?
I'd really like to find a solution that does not involve our own lock implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the very same idea, just didn't have time to try it out yet. I would prefer lock-less solution as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in cb9fbf2. After reading your suggestion few more times and more carefully, I believe it should actually work as is, since the deletions are now properly handled by sync
, which is called before each attempt.
At first I didn't really like the fact that the sync
is in the innermost loop, but it actually makes sense. It turns out to work kind of like optimistic locking, which is perfectly fine since we can assume that collisions will happen only very rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also left some comments/suggestions hope that helps : )
key_index.lua
Outdated
local N = self:sync() | ||
for _, key in pairs(keys) do | ||
-- Skip keys which already exist in this index or in the shared dict. | ||
if self.index[key]==nil and self.dict:get(key) == nil then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider this case:
- worker A delete a metrics X, then self.index[X] on worker A becomes nil
- it's rare but it could happen that in worker B we re-add the same metrics again, since self.index[X] on worker B is not nil, this metrics is not tracked in the __key_N in shidct
similarily on https://github.com/knyar/nginx-lua-prometheus/pull/82/files#diff-2a83fa3bc391b941b9bb7a56df5fad61R30 since we don't sync the parts less than__key_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. It seems to me that only reliable way to avoid this is to always do a full sync, which might result in worse performance. There might be also some more elaborate way to remove signal other workers to remove deleted items from their index, but that would noticeably increase complexity. Tough choice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have attempted to mitigate this problem in c4024e7. In the end I have chosen a compromise solution: If some thread deletes anything, other threads do a full sync. But in normal mode of operation only incremental sync is performed.
prometheus_keys.lua
Outdated
function KeyIndex:sync() | ||
local full_sync_hint = self.dict:get(self.sync_hint_prefix .. ngx.worker.id()) or 0 | ||
local N = self.dict:get(self.key_count) or 0 | ||
if full_sync_hint > 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no code currently that decreasess sync_hint_prefix
, so it seems that after the first key is deleted we'll be doing a full sync every time. Is that intentional?
Also, instead of keeping a separate sync hint counter per worker, I would suggest:
- a single monotonically increasing counter in the shared dictionary. After any deletion, the worker just increments this counter.
- a local variable storing the last sync hint value a worker previusly observed.
Something like this in sync()
:
local sync_hint = self.dict:get(self.sync_hint_key) or 0
local N = self.dict:get(self.key_count) or 0
if sync_hint > self.sync_hint then
self:sync_range(0, N)
self.sync_hint = sync_hint
elseif N ~= self.last then
self:sync_range(self.last, N)
end
return N
And in remove_by_index
:
self.dict:incr(self.sync_hint_key, 1, 0)
self.sync_hint = self.sync_hint + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I forgot to add the decrement. But your solution is actually much better, I'll implement that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in dc0c60f.
prometheus_keys.lua
Outdated
end | ||
end | ||
|
||
-- Sets timer to sync the index every interval seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary given that we trigger a sync anyway in add
and list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to put an upper bound to the time the local key_index might be out of sync. But you're probably right that it doesn't matter, since it would only be important when we collect
metrics and there we call sync
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 570964b.
I think we have resolved all of the issues. If you guys don't have an further ideas how to improve this, I will move on to writing some more rigorous tests to make sure the |
Thanks! I don't have any other comments. |
-- Sync only new keys, if there are any. | ||
self:sync_range(self.last, N) | ||
end | ||
return N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that self.last
does not get updated here, which means every sync is a full sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 84dd82e
@@ -627,14 +648,14 @@ function Prometheus:metric_data() | |||
-- Force a manual sync of counter local state (mostly to make tests work). | |||
self._counter:sync() | |||
|
|||
local keys = self.dict:get_keys(0) | |||
local keys = self.key_index:list() | |||
-- Prometheus server expects buckets of a histogram to appear in increasing | |||
-- numerical order of their label values. | |||
table.sort(keys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While running the benchmark, this occasionally produces an error:
2020/05/02 16:52:00 [error] 10#10: *417 lua entry thread aborted: runtime error: attempt to compare nil with string
stack traceback:
coroutine 0:
[C]: in function 'sort'
/nginx-lua-prometheus/prometheus.lua:654: in function 'metric_data'
/nginx-lua-prometheus/prometheus.lua:700: in function 'collect'
content_by_lua(nginx.conf:71):2: in function <content_by_lua(nginx.conf:71):1>, client: 172.17.0.1, server: benchmark, request: "GET /metrics HTTP/1.1", host: "localhost:18003"
Which means:
- we need to remove
nil
values from the list of keys before sorting it; - we need to understand how
nil
values appeared in the list of keys during benchmark (which only adds metrics, never removing them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, but I managed to catch this even in the integration tests. The oddest thing is, that I printed the contents of the table and there is no nil
in it. Also, the integration tests still pass, even though there are some 500 responses, that should be probably fixed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now I see... There is a missing index and table.sort
only works on continuous tables. Now just to find out how it happened when there is no delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that list
function returned reference to self.keys
. This table was then sorted in metric_data
, messing up al the indices. Stupid error
I know you are still working on this, @dolik-rce, but I ran the benchmark test yesterday against the latest commit (b8c6837) and the results are here. It seems that the tail latency impact is now very similar to 0.20200420. |
Thanks for the new benchmark data @knyar. I have been looking into the problem with high latency. Turns out it is mostly benchmark artifact. Only the very first request in each worker is slow and it is caused by the fact, that the entire I have updated the benchmark code to initialize the metrics in all workers and it brings the maximum latency back to few milliseconds. |
@knyar: I have added some tests for the |
Thanks! I think it looks good. Have you done any performance testing after the most recent changes to see whether there is still any improvement compared to 0.20200420? |
I have run the benchmark today, you can look at the results here. The percentiles are little different since I have different version of wrk and I was too lazy to tweak the default script. Overall, the difference in QPS is very noticeable for high number of metrics. As for the percentiles, the performance is about the same as the code from master. I must confess that I still don't really understand, what causes those huge drops in latency somewhere between 50k and 200k metrics. The only weird thing I noticed in logs is that for some of those tests is that the access logs suggests that the benchmarking didn't really last full minute, but only few seconds. However the wrk logs says it ran for a whole minute. Could it be caused by all workers being blocked be |
Yeah, I suspect |
Recently published optimizations forced me to revise my private patches of this wonderful library. And since I had to rewrite it from scratch, I have decided to offer the code to general public.
These changes provide a way to avoid
ngx.shared.dict:get_keys()
by keeping track of all the keys as they are added or deleted. Each worker keeps it's own list of keys, which is synced on as needed basis. There is still some locking involved, but it only stops current worker and only when creating a new metric/label at the same time as some other woker, which should be quite rare. We have already discussed this approach in past in #54.Please note that I still consider myself to be Lua newbie... I will be glad to learn from any mistakes and/or stylistic errors you might point out in the implementation 😃