-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request spike use CPU and memory #916
Comments
@miekg My original approach regarding workers was to use fixed workers count and drop or return SERVFAIL if all workers are busy - probably this is time to try this approach again I mean use dynamic spawn workers but on reach maxWorkersCount - drop or return SERVFAIL. /cc @rdrozhdzh |
I’m as strongly opposed to a fixed worker count and dropping responses as I was originally. It creates a magic tuning variable that must be perfectly chosen to use the server to its full potential. As it stands the long-lived worker approach is purely an optimisation. I’m not at all convinced it would solve those problems in any case. It’s not clear to me where or why this problem is occurring. |
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
I’m as strongly opposed to a fixed worker count and dropping responses as I was originally. It creates a magic tuning variable that must be perfectly chosen to use the server to its full potential. As it stands the long-lived worker approach is purely an optimisation.
I agree with that.
I’m not at all convinced it would solve those problems in any case. It’s not clear to me where or why this problem is occurring.
That's indeed not clear. While going over the code I couldn't directly make
sense of the comment in mentioned in the issue and what the code actually does.
I.e. do we really switch from workers to per-conn goroutines when the server is
too busy?
For reason I also don't understand it is also hard to replicate the exact issue.
/Miek
…--
Miek Gieben
|
I see the value in having auto-tuned configuration settings but being able to override them with a hard limit is something that is sorely needed in CoreDNS because auto-tuning doesn't always work (as in this case). The requests are returned with SERVFAIL anyway, so there would be no harm in doing so (for our use case, at least). We are working on a pprof. It may be difficult to do because this condition effectively hangs the box until CoreDNS is stopped. |
looking at the latest comments "over at coredns" I'm inclined to revert the worker PR |
@WinstonPrivacy having some magic parameters deep in this library is not a good way forward; once you start adding those knobs there is no way back. |
@miekg I agree totally. If there isn’t any small change that would make the worker code functional, I think reverting it is the only acceptable choice. |
As I recall one of the reasons we merged the worker code in the first place
was because under high load we were seeing crashes. Has that changed?
…On Thu, Mar 7, 2019 at 2:05 PM Tom Thorogood ***@***.***> wrote:
@miekg <https://github.com/miekg> I agree totally. If there isn’t any
*small* change that would make the worker code functional, I think
reverting it is the only acceptable choice.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#916 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJB4s1lQmiUooNZDQO-Oy3m3yCAKFqk7ks5vUY0PgaJpZM4bWHu5>
.
|
Revert to spawn workers for each request just will make the behaviour worse. |
1 similar comment
This comment has been minimized.
This comment has been minimized.
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
As I recall one of the reasons we merged the worker code in the first place
was because under high load we were seeing crashes. Has that changed?
no, this was purely billed as a performance feature.
net/http/Server does the go-routine per request as well and if not that code
should be copied/studied.
98a1ef4 doesn't cleanly revert (of course), so it will take some poking to make
that work.
|
Sorry .. I am late in this thread: I did not identified it was the some topic as coredns/coredns#2593 on another repo (but the right one). IMO, we need to have more data to propose a direction. I want now to check what happen when we provide a huge burst (from 0 to complete overwhelemed) - which is the initial problem related in the CoreDNS issue. I would like also to see the effect in 3 cases:
And in these 2 use cases:
The tests are in progress with @rajansandeep NOTE: our tests are limited to k8s though .... At the end if we want to have it proper for production in k8s, that most likely will need some "magic" numbers. As everyone do not have same usage and same environment, I guess those magic numbers should just be adjustable. |
The original workers proposal fixed the number of workers, and so it
stopped a crash that was happening under heavy load. By the time in merged,
it looks like we just started spawning more workers/go routines when
overwhelmed, and so all that was left was the performance gain.
I don't care workers one way or the other, but I don't think we should make
a big change that doesn't fix the crashing. I don't see how reverting the
workers is going to fix that.
…On Thu, Mar 7, 2019 at 11:53 PM Miek Gieben ***@***.***> wrote:
[ Quoting ***@***.***> in "Re: [miekg/dns] Request spike
use C..." ]
>As I recall one of the reasons we merged the worker code in the first
place
>was because under high load we were seeing crashes. Has that changed?
no, this was purely billed as a performance feature.
net/http/Server does the go-routine per request as well and if not that
code
should be copied/studied.
98a1ef4 doesn't cleanly revert (of course), so it will take some poking
to make
that work.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#916 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJB4s4lZ5ztGy6iRBnYfyEpWzkoLK5lQks5vUhbggaJpZM4bWHu5>
.
|
Fixed worker count like really bad idea. One of my public resolver server is working currently 1200 goroutines and very stable. You can check the snapshot metrics from https://snapshot.raintank.io/dashboard/snapshot/tDwYNnw0icrBCeTcO4gwsF29kMJD5e1h |
@semihalev : your metrics shows you handle 1200 go-routines, but also 750 MBi. |
@miekg : I understand you already merged the revert of worker mechanism. And now we have direct spawn of go-routine to call serve(...) either UDP or TCP incoming msgs. We still have to figure out if we want to be able to limit the spike of Memory or CPU when there is a burst of incoming queries ? |
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
@miekg : I understand you already merged the revert of worker mechanism. An are now with direct call of go-routines to server either UDP or TCP incoming msgs.
yes, and we're are going to clean this up further. That previous code had been
working for 8 years+
We still have to figure out if we want to be able to limit the spike of Memory or CPU when there is a burst of incoming queries ?
I mean this issue is still open .. correct ?
Yes, I'm not saying we shouldn't do anything. But there is a whole slew of
things that need to properly looked into; for one I have no idea anymore if it's
cpu or memory that people are screaming about.
And as far as I'm aware no script or automated way has been posted to reproduce
anything? No pprofs have been generated, so is it's not clear where this memory
is allocated and used. Same for the cpu usage.
/Miek
…--
Miek Gieben
|
When something like this happens, you *rollback*. This is that rollback.
|
[testing on packet w/ coredns and miekg/dns version 1.1.5] Just forwarding queries leads to 100-200 goroutines and 45 MB RRS memory. Hitting a coredns with whoami only gives about same numbers. Forwarding does 20K qps and hitting server directly is 40Kqps. CPU goes up to 2-3 CPU s/s which correlates nicely to how many CPUs this little box has. I see nothing worrisome what so ever in these figures. |
@fturib the most of memory spike from caches. |
the memory @semihalev quotes is probably the virtual memory and hopefully you'll read the memory faq for Go: https://golang.org/doc/faq#Why_does_my_Go_process_use_so_much_virtual_memory The actual memory is probably in the low 40's |
@semihalev , @miekg : I agree, I wrote too quickly, 1.2k go-routines is not so much a trouble for memory. We can see here that we keep under 70MB with that amount of go-routines for serving msgs. |
I will try to re-explain what we see on these tests of CoreDNS:
With no ability to throttle the number of go-routines, the application (I mean here CoreDNS) has no way to ensure its memory stay under a limit. And The crash happen by OOM. The global problem is not really for DNS but for the Application above: what strategy to use to ensure that you can continue to server DNS queries whatever the incoming query pressure ? IMO, It comes down into DNS repo in this question : what is the option of Unless you think there should be another strategy for the application ? |
Can someone try this approach on the CoreDNS side? You have to find the right spot, but you’d want it to be the first handler that is set directly on the |
@WinstonPrivacy what are you testing (version/config/etc). How are you testing? And note your initial bug was a cascading failure and being stuck in a high cpu stage. Im sorry, but htop screenshots don't give a lot of detail to work with. Also you asking a binary to do something; sure that will increase cpu and mem. @tmthrgd thanks. I can test that; BIND9 has a similar knob for (I think) recursive requests only. |
@miekg - I was responding to your earlier comment about not being sure
whether it was CPU or memory that is affected. In our case, it is both.
This htop screenshot was taken under the same error condition as originally
reported. CoreDNS frequently becomes unresponsive when running in our
resource constrained environment. This occurs when another process
temporarily consumes resources. for instance, we can reliably trigger it
now when either building a Go program or running a database query, but
really anything can set it off. It has occurred three times on my local
board today which is handling a modest amount of internet traffic.
I'm running on the same environment that @willwp has provided previously
details on.
…On Sun, Mar 10, 2019 at 4:12 AM Miek Gieben ***@***.***> wrote:
@WinstonPrivacy <https://github.com/WinstonPrivacy> what are you testing
(version/config/etc). How are you testing? And note *your* initial bug
was a cascading failure and being stuck in a high cpu stage.
Im sorry, but htop screenshots don't give a lot of detail to work with.
Also you asking a binary to do something; sure that will increase cpu and
mem.
@tmthrgd <https://github.com/tmthrgd> thanks. I can test that; BIND9 has
a similar knob for (I think) recursive requests only.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#916 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/Ai89gUZDWFlaTfxDGY8UdvHreBra5MPCks5vVMxngaJpZM4bWHu5>
.
|
@miekg here are the config/version details: ARM64 dual core 1GHz CPU, 1 GB RAM.
The problem occurs as @WinstonPrivacy mentioned:
In the first attached Prometheus screen shot I reproduced the problem at 5pm, go routines spike to > 2000, and memstate_sys_bytes climb, SERVFAIL is returned from queries until 7:45 pm. The other resource hungry process that kicked off the high CPU/SERVFAIL query response had only ran for 5-10 minutes. The next day I reproduce the problem at 9pm, and the go routines spike on the end of the graph. The next graph shows process cpu_seconds and coredns request count. I'm doubting the accuracy of the process_cpu_seconds measurement because htop indicated CPU was pegged 80-100% the entire time. Please note we were not able to fetch results from the CoreDNS pprof plugin during the time CPU spiked and SERVFAIL for the queries. |
@miekg I want to add in interesting results when I disabled the worker pool in CoreDNS:
Before @tmthrgd's change it would take CoreDNS > 2 hours CPU to reduce, and for DNS queries to return NOERROR, after the catalyzing heavy build or database query were stopped. |
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
@miekg I want to add in interesting results when I disabled the worker pool in CoreDNS:
I disabled the worker pool as proposed by @tmthrgd in his [comment](coredns/coredns#2593 (comment)) .
Improvements noticed:
- CoreDNS recovered from > 80% CPU after 2-5 minutes, after the other resource hungry process is done using CPU and memory resources.
- CoreDNS was no longer stuck in SERVFAIL, and within 2-5 minutes starts resolving DNS queries.
Before @tmthrgd's change it would take CoreDNS > 2 hours CPU to reduce, and for DNS queries to return **NOERROR**, after the catalyzing heavy build or database query were stopped.
Good news! Thanks. Can you compile a later coredns, maybe from master?
I think it's easy enough to overwhelm a forwarding proxy with backed up request,
as you can never catch up.
What could work here is to drop requests that are older than 6 seconds or
something, because the client has given up on them. Have to think a bit if this
is a) a smart thing to do, b) doable (in the code).
In the above scenario you still observe a bunch of goroutines I assume?
|
I was going to suggest this as well... a timeout on queries would indeed be
useful in our case because clients typically just give up and try again in
a few seconds.
…On Sun, Mar 10, 2019 at 1:46 PM Miek Gieben ***@***.***> wrote:
[ Quoting ***@***.***> in "Re: [miekg/dns] Request spike
use C..." ]
***@***.*** I want to add in interesting results when I disabled the worker
pool in CoreDNS:
> I disabled the worker pool as proposed by @tmthrgd in his [comment](
coredns/coredns#2593 (comment)) .
>Improvements noticed:
>
>- CoreDNS recovered from > 80% CPU after 2-5 minutes, after the other
resource hungry process is done using CPU and memory resources.
>
>- CoreDNS was no longer stuck in SERVFAIL, and within 2-5 minutes starts
resolving DNS queries.
>
>Before @tmthrgd's change it would take CoreDNS > 2 hours CPU to reduce,
and for DNS queries to return **NOERROR**, after the catalyzing heavy build
or database query were stopped.
Good news! Thanks. Can you compile a later coredns, maybe from master?
I think it's easy enough to overwhelm a forwarding proxy with backed up
request,
as you can never catch up.
What could work here is to drop requests that are older than 6 seconds or
something, because the client has given up on them. Have to think a bit if
this
is a) a smart thing to do, b) doable (in the code).
In the above scenario you still observe a bunch of goroutines I assume?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#916 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/Ai89gbVaidOjWRNP7LxmPoD3vWkEW5m6ks5vVVLogaJpZM4bWHu5>
.
|
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
I was going to suggest this as well... a timeout on queries would indeed be
useful in our case because clients typically just give up and try again in
a few seconds.
Ack, that make sense. There are timeout(s) in forward, but no, "drop it after 6
seconds".
See coredns/coredns#2663. This probably makes sense
for *all* queries coming in; after 6 seconds no-one is interested in it anymore.
|
[ Quoting notifications@github.com in "Re: [miekg/dns] Request spike use C..." ]
I'll go ahead and compile with from the master of coredns, reproduce, and measure.
I didn't have the Prometheus plugin/server running in the disabled worker pool case. I'll make sure and have it running when I repo, and share the results. |
It is not what we get here, whatever release DNS 1.15 or 1.1.6 (we tested with both) - which is inline with the problem of go-routines we identified. I will try the proposition of @tmthrgd here, hoping that will help to see if the go-routine problem is when used by CoreDNS only (maybe these routines are stuck in a part of CoreDNS, holding some memory). NOTE: the pprof output are available on these tests, but I am not sure to be able to interpret properly, as the memory notified in the pprof is only 10% of the one notified by prometheus. |
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
>Just forwarding queries leads to 100-200 goroutines and 45 MB RRS memory. Hitting a coredns with whoami only gives about same numbers. Forwarding does 20K qps and hitting server directly is 40Kqps.
It is not what [we get here](coredns/coredns#2593 (comment)), whatever release DNS 1.15 or 1.1.6 (we tested with both) - which is inline with the problem of go-routines we identified.
**However we are in Kubernetes environment** and that maybe do the difference.
I've said this time and time again, inside k8s is crazy town, because you can't
easily test or reproduce (and I've raised repeated issue on creating a a way to
make this easier: coredns/coredns#2575 is the latest)
I'm still not sure what's being measured inside kubernetes, RSS or virtual
memory.
@WinstonPrivacy actually has a use case that's easily tested and already saw an
improvement. Further more I'm pretty fixing this issue coredns/coredns#2663
will make that better (can't think of another reason queries backing up).
Forward might be made more efficient.
I think *this* issue is solved *here* and can be closed. Either new issues
proposing new approaches can be openened or efficiency wins are always welcome.
/Miek
…--
Miek Gieben
|
I’d like to see if my suggestion of rate limiting on the CoreDNS side can solve this, if not I’ve got a potential minimal change to this repo, that would allow rate limiting to be added externally without being invasive. |
@tmthrgd : I am working on it. Will let you know. If works, I agree that we can maybe just add as a CoreDNS plugin. |
[ Quoting <notifications@github.com> in "Re: [miekg/dns] Request spike use C..." ]
@tmthrgd : I am working on it. Will let you know.
(i didn't test this on Sunday)
If works, I agree that we can maybe just add as a CoreDNS plugin.
I think you're already to late. In general rate limiting should be the last
thing you do.
Without a root-cause it seems this is jumping the gun.
|
We'd be willing to try out rate limiting here. The current implementation
is not robust when starved for resources.
…On Wed, Mar 13, 2019 at 1:15 PM Miek Gieben ***@***.***> wrote:
Closed #916 <#916>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#916 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/Ai89gZfnl922nkD4Ni11AiAQs_tyawGcks5vWUAugaJpZM4bWHu5>
.
|
These are two issues that are interesting:
Both are not reproduced in a minimal setting and I failed reproducing it.
It looks like if you restrain the CPU a process has it can get behind and (in some cases) never recover. One way this can show up in this library is too many worker goroutines.
I took a quick look at the code in server.go, in this comment at the top of the file looked a bit ominous:
esp. the last paragraph.
As said, I'm not sure how best to repro this (just sit on queries for 10 seconds?), but if we can it would help to debug the above issues and would be a good (unit) test in general.
The text was updated successfully, but these errors were encountered: