Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid ngx.shared.dict:get_keys #82

Merged
merged 21 commits into from
May 13, 2020
Merged

Conversation

dolik-rce
Copy link
Contributor

Recently published optimizations forced me to revise my private patches of this wonderful library. And since I had to rewrite it from scratch, I have decided to offer the code to general public.

These changes provide a way to avoid ngx.shared.dict:get_keys() by keeping track of all the keys as they are added or deleted. Each worker keeps it's own list of keys, which is synced on as needed basis. There is still some locking involved, but it only stops current worker and only when creating a new metric/label at the same time as some other woker, which should be quite rare. We have already discussed this approach in past in #54.

Please note that I still consider myself to be Lua newbie... I will be glad to learn from any mistakes and/or stylistic errors you might point out in the implementation 😃

@coveralls
Copy link

coveralls commented Apr 21, 2020

Coverage Status

Coverage increased (+0.6%) to 93.888% when pulling 5985978 on dolik-rce:avoid-get_keys into 3890f1e on knyar:master.

@knyar
Copy link
Owner

knyar commented Apr 22, 2020

I hate to push back given the amount of work that went into preparing a PR, but are you sure this will have any meaningful performance impact?

With per-worker counters we no longer expect dictionary keys to be incremented in the context of request handling. Counter sync functions are executed asynchronously using ngx.timer, and even if they have to wait for a bit for the dictionary to become unlocked, I don't believe that will impact request latency.

When possible, I would like to avoid additional complexity, keeping the code understandable and testable. Implementing concurrency primitives (like locks) correctly is hard, and concurrent systems are notoriously difficult to reason about and test. Testing concurrent behaviour of lua-nginx-module is even harder, since it's a mix between lua runtime and native nginx code (written in C). The easiest way to test is probably to write a bunch of test code in nginx config and use an external process to run it and check output (similar to the existing integration test).

I believe we'll need the following for me to be comfortable merging this:

  1. Some measure of performance difference this change will make. My personal opinion is that this needs to have at least 25% performance improvement to justify additional complexity (folks in optimize for performance #75 reported 5x improvement, which made it clearly worth it), but I am happy to discuss the specific threshold.
  2. Tests validating correct behaviour of Lock and KeyIndex, verifying that operations work correctly if executed concurrently, counters report correct values, and old metrics don't resurface after del and reset.

I know it's a lot to ask, and I am sorry if it feels that the work you've already done is not appreciated. I hope my position here makes sense. Please let me know how you would like to proceed.

@hbagdi
Copy link

hbagdi commented Apr 22, 2020

ping @fffonion

@dolik-rce
Copy link
Contributor Author

I totally understand your points. This is a big and ugly patch and it definitely deserves some serious evidence.

So here is a little self-contained benchmark: https://gist.github.com/dolik-rce/1b1a6d844fe51654d9bd8589db85bcff. Just create benchmark/ directory in the root of the repository and copy both files in it. When you run it, it will detect if it is in the original version, or in my branch and modify the configuration accordingly.

It's very artificial benchmark, but I tried to keep it minimal to make sure I measure the right thing. Here are some numbers from my machine, so you don't have to run it before reading further:

Metric count dict:get_keys metric_data (using dict:get_keys) key_index:get metric_data (using key_index:get)
0 0,00001 0,00000 0,00002 0,00000
10000 0,00000 0,74350 0,00000 0,37570
20000 0,01428 1,67360 0,00000 0,84800
30000 0,12555 2,70250 0,00000 1,33760
40000 0,22434 4,39140 0,00000 1,62960
50000 0,45670 6,41070 0,00000 1,93050

All times are in seconds and averaged over 100 or 10 calls (metric_data is really slow, so I called it less), joined together from two separate runs, first in master, second in my branch.

Now to comment these numbers:

  • As you can see, the speed of data collection did improve 2-3 times
  • The timings of key_index:get were not even measurable. When I use higher get_keys_samples, it turns out to be about 30-60 microseconds, so the speed-up here would be about 5-6 orders of magnitude
  • Both points above are nice, but they're pretty pointless, considering that data collection doesn't happen all that often. The important thing here is, that dict:get_keys blocks all workers. I use this library on servers that have serve over 3K req/s, so blocking all workers for 100 ms every time we collect metrics is noticeable... On the other hand, the proposed solution does not block other workers on collect and only possibly might block some threads that try to add a new metric at the same time.

Concerning the second point: I agree that more tests are definitely needed and that they will not be easy to write. I wanted to discuss this proposal first and gather some opinions from you and other interested users of this library, before investing even more time in this. It would suck to spend few days writing tests only to be told that it all has to be rewritten because I chose wrong API or something 🙂

@knyar
Copy link
Owner

knyar commented Apr 22, 2020

Thanks for the quick response, @dolik-rce! I think what we need to measure is not the latency of metric collection itself, but the latency of other requests that get processed by nginx while metric collection is happening.

I might be missing something, but as I understand dict:get_keys should not simply block all workers. It will block other dictionary operations from other workers, all of which (for counter metrics, of course) should now be happening outside of request context. So ideally this should not introduce delays to other requests.

I will try to run some experiments later this week, but if you get a chance to measure impact of dict:get_keys on other requests, I'd be very interested to see the results.

@dolik-rce
Copy link
Contributor Author

You might be right... I did measure the impact on other requests in past, but it was long before the latest performance updates. I will definitely try to benchmark that as well.

@dolik-rce
Copy link
Contributor Author

Sorry this took so long, but I wanted to make sure I understand the benchmark results correctly. You were right, that the situation is much better with the resty counter, since the dict is not accessed that often. But I have found couple scenarios, where the long get_keys call can still cause trouble.

Problem 1:

If get_keys blocks for longer then sync_interval, than all workers will be blocked. Even if it less then sync_interval, there is still chance that some of the workers will have to wait to access the dict.

Problem 2:

If collect call is N times longer then interval between scrapes, then just scraping metrics will keep N workers busy all the time. Depending on number of metrics and workers, things can get as far as the whole application serving just metrics. I used 250k metrics and 4 workers in the benchmark and with 10s interval between scrapes I was able to bring the application down, not serving any other requests. The worst part of this is that even if you set up correct timeout on the client, it actually doesn't end the lua code already executing on the server. So even if the client times out after 5s the collect method can easily keep the worker busy for another 20s.

Problem 3:

I believe, there are some "gotchas" related to the way async way nginx is handling requests and the way lua module works. It matches what I've seen in the benchmark, I' just not sure if my explanation is correct, so take following with a grain of salt. Consider this flow of events:

  • regular request (let's call it A) is accepted
  • /metrics request (let's call it B) is accepted by the same worker
  • A is processed in some of the early stages
  • B gets into the content phase
  • the collect function takes couple seconds, but since it is a single block of lua code, it blocks the worker and never yields, until it generates whole content
  • rest of phases are processed for A
  • rest of phases are processed for B

The result would be that (from clients point of view) request A would take at least as long as the collect call. For some latency critical apps, this might be way too long.

Possible solutions

Problem 1 should be solved by this PR, because the blocking get_keys would be no longer used.

I do not know complete solution for problems 2 and 3, but making collect faster (or using less metrics) will make it less of a problem.

Problem 2 might be also partially mitigated by spacing the counter syncs across time. Currently all the timers are set right away in init_worker, so they probably run rather "coherently", executing at similar times. If the timer was offseted by (ngx.worker.id() * sync_interval / ngx.worker.count()) then it would minimize the chance that all the workers are blocked because their sync happened at the same time as get_keys call in collect.

Actual numbers

Here is a benchmarking code I used: https://gist.github.com/dolik-rce/58adb967288206aec7c0065dc8c8ed17. It tests three versions of the code (pre-optimization, current master and my proposal) with three distinct cases:

  • none: Just reqular requests, no metrics are collected
  • serial: Metrics are collected, but each call waits until the previous call has finished
  • parallel: Metrics are collected every 10s, with 5s timeout. (This is not truly parallel, but since the lua code on server doesn't respect timeout, it actually runs in parallel)

Longest request

Version No metrics Serial metrics Parallel metrics
1e13e22 0.205 s 53.106 s Failed
0.20200420 0.024 s 60.004 s Failed
avoid-get_keys 12.255 s 12.622 s 59.691 s

There is definitely something wrong with my version, It should not take 12s when there is no metrics collection. But on the other hand, it behaves better for serial metrics and it's the only version that survived the parallel test :-) Both is because calls to collect are faster. The other two versions failed because regular request timed out after 120s.

Average performance

Version No metrics Serial metrics Parallel metrics
1e13e22 2263.71 req/s 1882.43 req/s Failed
0.20200420 3150.92 req/s 989.43 req/s Failed
avoid-get_keys 3330.97 req/s 3699.68 req/s 598.39 req/s

Here we can see that the optimizations done in last release really give about 40% boost in speed for regular requests, when metrics collection is not considered. Interestingly, when we start bothering the server with long metrics requests, the performance of older code is actually better. This branch performs about the same without metrics, but keeps the performance for serial metrics as well. When run with parallel metrics collection, the performance goes rapidly down, but at least it kept working. This is probably just thanks to collect being somewhat faster, as described above in the Possible solutions paragraph.

@knyar
Copy link
Owner

knyar commented Apr 27, 2020

Thanks for sharing the results, @dolik-rce! Your learnings and the problems you describe match my understanding of how Lua module in nginx work.

Please give me a few days to run the benchmark myself and compare results, since I'd like to run it with a fewer number of metrics. 200K time series is definitely a lot: in Prometheus 1.8 that was the number active of time series supported by the storage system with default settings (across all targets!), and I think you would typically expect a single target to return no more than a few thousand time series, not hundreds of thousands. So while it's interesting to observe what happens when the number of time series gets so high that metric collection starts consuming comparable amount of CPU time to the actual service provided by nginx (which is at the core of the problems you describe), I don't think it's a use case we should focus on while measuring performance.

Problem 2 might be also partially mitigated by spacing the counter syncs across time.

I've been thinking about this while reviewing #75, and it's the reason I moved sync_interval to init_worker() in f3d6f2e.

It would be nice to just stagger sync intervals automatically, however ngx.sleep is not available in the init_worker_by_lua context, so it's not possible to introduce a slight delay before initializing the per-worker counter. We could decouple initializing the counter from scheduling the counter sync (and use ngx.timer.at to schedule the sync with a delay), but this will require making changes to the resty_counter lib. Which, again, means more complexity for a performance difference that I'd like to quantify first.

@dolik-rce
Copy link
Contributor Author

I'm well aware that hundreds thousand of lines is way too much. I have blown it out of proportions to make the times easier to measure. The biggest export in my company that I know of is about 10k, produced by MySQL Server Exporter. For comparison, it returns those 10k metrics in about 270ms, so this lua module is only about 3x slower for the same, amount of metrics. That is pretty good, considering that it uses Go which should be faster and more optimized, since it is compiled language.

Problem 2 might be also partially mitigated by spacing the counter syncs across time.

I've been thinking about this while reviewing #75, and it's the reason I moved sync_interval to init_worker() in f3d6f2e.

It would be nice to just stagger sync intervals automatically, however ngx.sleep is not available in the init_worker_by_lua context, so it's not possible to introduce a slight delay before initializing the per-worker counter.

It is possible, but it's just not very pretty... You could use a busy loop, or call os.execute("sleep " .. delay). Or one of the plenty suggestions available at Lua wiki. Sure, it is ugly, but it would get the work done.

@knyar
Copy link
Owner

knyar commented Apr 29, 2020

I got some benchmark results that we can discuss. I have adjusted your benchmark configuration in the following way:

  • wrk used instead of ab, since it allows extracting higher percentile latency values.
  • instead of having 4 clients send 100K requests (which was your ab configuration) I got 10 clients send requests continuously for 1 minute.
  • test executed several times with increasing number of time series (1000 to 1000000).
  • parallel metric collection interval decreased from 10 seconds to 1 second.

I ran the test on a new "CPU-Optimized / 8 GB / 4 vCPUs" VM at Digial Ocean, raw logs are here. For latency assessment I am mostly looking at the parallel mode (more representative of how Prometheus is used in practice), with 99.99th percentile as the tail measure. Note that all graphs use logarithmic scale.

Some observations:

  • @fffonion's changes in optimize for performance #75 massively improved median request latency (which increased throughput), however actually decreased latency at the tail.
  • for the range of 5K-100K time series this change seems to significantly improve tail latency, while also slightly increasing median latency (by about 10%; I guess this could also be a measurement error).
  • in parallel mode metric collection starts timing out around 30K time series, at which point throughput and median latency start degrading.
  • after about 100K time series we clearly start spending too much resources on metric collection and results become very noisy.
  • these results confirm extremely high tail latency of avoid-get_keys for high time series counts (30K and higher) that you have ovserved as well. This seems like a bug.

I think the tail latency improvements are worth pursuing this change. I will take a closer look at the code later today or tomorrow and will leave some comments. Thanks for your paience, @dolik-rce!

@dolik-rce
Copy link
Contributor Author

Thank you for investing all that work into the benchmark and graphs! It is much more informative than my simple tables.

these results confirm extremely high tail latency of avoid-get_keys for high time series counts (30K and higher) that you have ovserved as well. This seems like a bug.

Yes, that is definitely a bug, I'll try to look into it.

key_index.lua Outdated Show resolved Hide resolved
key_index.lua Outdated Show resolved Hide resolved
key_index.lua Outdated Show resolved Hide resolved
key_index.lua Outdated
-- This must happen atomically, otherwise there could be a race condition
-- and other workers might create the same records at the same time
-- with different values.
if self.lock:wait() then
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of spinning to get a lock, workers can just spin trying to get a new unused index key. This should make locking unnecessary, I believe.

What are your thoughts on this?

for _, key in pairs(keys) do
  while (true) do
    local N = self:sync()
    if self.index[key] ~= nil then break end

    N = N+1
    ok, err = self.dict:safe_add(KEY_PREFIX .. N, key)
    if ok then
      self.dict:incr("__key_count", 1, 0)
      self.keys[N] = key
      self.index[key] = N
      break
    elseif err ~= "exists" then
      return "unexpected error adding a key: " .. err
    end
  end
end

Copy link
Contributor Author

@dolik-rce dolik-rce Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small problem with this. Consider this (very weird, but possible) scenario:

  • worker A calls sync, which returns N=10
  • worker B calls sync, which returns N=10
  • worker A creates new metric (N+1=11)
  • worker A removes last metric (index[N=11] = nil)
  • worker B successfully calls safe_add with N=11

Now worker B thinks that the new metric has index 11, while worker A thinks it is deleted and could later create a new metric with the same name but different index.

It is actually somewhat similar to what @fffonion wrote below.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, if the same worker creates and immediately removes the metric, other workers might be able to re-use the same index. Can we make this impossible by setting the index key to empty string instead of nil when a metric is deleted?

I'd really like to find a solution that does not involve our own lock implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the very same idea, just didn't have time to try it out yet. I would prefer lock-less solution as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in cb9fbf2. After reading your suggestion few more times and more carefully, I believe it should actually work as is, since the deletions are now properly handled by sync, which is called before each attempt.

At first I didn't really like the fact that the sync is in the innermost loop, but it actually makes sense. It turns out to work kind of like optimistic locking, which is perfectly fine since we can assume that collisions will happen only very rarely.

prometheus.lua Outdated Show resolved Hide resolved
key_index.lua Outdated Show resolved Hide resolved
Copy link
Contributor

@fffonion fffonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also left some comments/suggestions hope that helps : )

key_index.lua Outdated Show resolved Hide resolved
lock.lua Outdated Show resolved Hide resolved
key_index.lua Outdated
local N = self:sync()
for _, key in pairs(keys) do
-- Skip keys which already exist in this index or in the shared dict.
if self.index[key]==nil and self.dict:get(key) == nil then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider this case:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. It seems to me that only reliable way to avoid this is to always do a full sync, which might result in worse performance. There might be also some more elaborate way to remove signal other workers to remove deleted items from their index, but that would noticeably increase complexity. Tough choice...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have attempted to mitigate this problem in c4024e7. In the end I have chosen a compromise solution: If some thread deletes anything, other threads do a full sync. But in normal mode of operation only incremental sync is performed.

function KeyIndex:sync()
local full_sync_hint = self.dict:get(self.sync_hint_prefix .. ngx.worker.id()) or 0
local N = self.dict:get(self.key_count) or 0
if full_sync_hint > 0 then
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no code currently that decreasess sync_hint_prefix, so it seems that after the first key is deleted we'll be doing a full sync every time. Is that intentional?

Also, instead of keeping a separate sync hint counter per worker, I would suggest:

  • a single monotonically increasing counter in the shared dictionary. After any deletion, the worker just increments this counter.
  • a local variable storing the last sync hint value a worker previusly observed.

Something like this in sync():

local sync_hint = self.dict:get(self.sync_hint_key) or 0
local N = self.dict:get(self.key_count) or 0
if sync_hint > self.sync_hint then
  self:sync_range(0, N)
  self.sync_hint = sync_hint
elseif N ~= self.last then
  self:sync_range(self.last, N)
end
return N

And in remove_by_index:

self.dict:incr(self.sync_hint_key, 1, 0)
self.sync_hint = self.sync_hint + 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I forgot to add the decrement. But your solution is actually much better, I'll implement that. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dc0c60f.

end
end

-- Sets timer to sync the index every interval seconds.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary given that we trigger a sync anyway in add and list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to put an upper bound to the time the local key_index might be out of sync. But you're probably right that it doesn't matter, since it would only be important when we collect metrics and there we call sync anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 570964b.

@dolik-rce
Copy link
Contributor Author

I think we have resolved all of the issues. If you guys don't have an further ideas how to improve this, I will move on to writing some more rigorous tests to make sure the KeyIndex class works correctly.

@knyar
Copy link
Owner

knyar commented May 2, 2020

I think we have resolved all of the issues. If you guys don't have an further ideas how to improve this, I will move on to writing some more rigorous tests to make sure the KeyIndex class works correctly.

Thanks! I don't have any other comments.
I will try to run another round of benchmark tests later today or tomorrow.

-- Sync only new keys, if there are any.
self:sync_range(self.last, N)
end
return N
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that self.last does not get updated here, which means every sync is a full sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 84dd82e

@@ -627,14 +648,14 @@ function Prometheus:metric_data()
-- Force a manual sync of counter local state (mostly to make tests work).
self._counter:sync()

local keys = self.dict:get_keys(0)
local keys = self.key_index:list()
-- Prometheus server expects buckets of a histogram to appear in increasing
-- numerical order of their label values.
table.sort(keys)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running the benchmark, this occasionally produces an error:

2020/05/02 16:52:00 [error] 10#10: *417 lua entry thread aborted: runtime error: attempt to compare nil with string
stack traceback:
coroutine 0:
        [C]: in function 'sort'
        /nginx-lua-prometheus/prometheus.lua:654: in function 'metric_data'
        /nginx-lua-prometheus/prometheus.lua:700: in function 'collect'
        content_by_lua(nginx.conf:71):2: in function <content_by_lua(nginx.conf:71):1>, client: 172.17.0.1, server: benchmark, request: "GET /metrics HTTP/1.1", host: "localhost:18003"

Which means:

  1. we need to remove nil values from the list of keys before sorting it;
  2. we need to understand how nil values appeared in the list of keys during benchmark (which only adds metrics, never removing them).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, but I managed to catch this even in the integration tests. The oddest thing is, that I printed the contents of the table and there is no nil in it. Also, the integration tests still pass, even though there are some 500 responses, that should be probably fixed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I see... There is a missing index and table.sort only works on continuous tables. Now just to find out how it happened when there is no delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that list function returned reference to self.keys. This table was then sorted in metric_data, messing up al the indices. Stupid error ☹️ Should be fixed in 9e4d40a by returning a copy.

@knyar
Copy link
Owner

knyar commented May 4, 2020

I know you are still working on this, @dolik-rce, but I ran the benchmark test yesterday against the latest commit (b8c6837) and the results are here. It seems that the tail latency impact is now very similar to 0.20200420.

@dolik-rce
Copy link
Contributor Author

Thanks for the new benchmark data @knyar.

I have been looking into the problem with high latency. Turns out it is mostly benchmark artifact. Only the very first request in each worker is slow and it is caused by the fact, that the entire key_index must be synced, which can tak a couple seconds. This should not be a problem in regular usage, since timeseries creation will be distributed across all workers and also over the lifetime of the application.

I have updated the benchmark code to initialize the metrics in all workers and it brings the maximum latency back to few milliseconds.

@dolik-rce
Copy link
Contributor Author

@knyar: I have added some tests for the KeyIndex, can you please take a look at them? Everything is nice and green now.

@knyar
Copy link
Owner

knyar commented May 11, 2020

Thanks! I think it looks good.

Have you done any performance testing after the most recent changes to see whether there is still any improvement compared to 0.20200420?

@dolik-rce
Copy link
Contributor Author

I have run the benchmark today, you can look at the results here. The percentiles are little different since I have different version of wrk and I was too lazy to tweak the default script. Overall, the difference in QPS is very noticeable for high number of metrics. As for the percentiles, the performance is about the same as the code from master.

I must confess that I still don't really understand, what causes those huge drops in latency somewhere between 50k and 200k metrics. The only weird thing I noticed in logs is that for some of those tests is that the access logs suggests that the benchmarking didn't really last full minute, but only few seconds. However the wrk logs says it ran for a whole minute. Could it be caused by all workers being blocked be /metrics requests and wrk only counting finished requests?

@knyar knyar merged commit 9ec4a62 into knyar:master May 13, 2020
@knyar
Copy link
Owner

knyar commented May 13, 2020

Could it be caused by all workers being blocked be /metrics requests and wrk only counting finished requests?

Yeah, I suspect wrk might not actually track the latency of requests that timed out.

@dolik-rce dolik-rce deleted the avoid-get_keys branch May 14, 2020 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants