Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request spike use CPU and memory #916

Closed
miekg opened this issue Feb 28, 2019 · 43 comments
Closed

Request spike use CPU and memory #916

miekg opened this issue Feb 28, 2019 · 43 comments

Comments

@miekg
Copy link
Owner

miekg commented Feb 28, 2019

These are two issues that are interesting:

Both are not reproduced in a minimal setting and I failed reproducing it.

It looks like if you restrain the CPU a process has it can get behind and (in some cases) never recover. One way this can show up in this library is too many worker goroutines.

I took a quick look at the code in server.go, in this comment at the top of the file looked a bit ominous:

// The maximum number of idle workers.
//
// This controls the maximum number of workers that are allowed to stay
// idle waiting for incoming requests before being torn down.
//
// If this limit is reached, the server will just keep spawning new
// workers (goroutines) for each incoming request. In this case, each
// worker will only be used for a single request.
const maxIdleWorkersCount = 10000

esp. the last paragraph.

As said, I'm not sure how best to repro this (just sit on queries for 10 seconds?), but if we can it would help to debug the above issues and would be a good (unit) test in general.

@miekg
Copy link
Owner Author

miekg commented Feb 28, 2019

/cc @UladzimirTrehubenka

@UladzimirTrehubenka
Copy link
Contributor

@miekg My original approach regarding workers was to use fixed workers count and drop or return SERVFAIL if all workers are busy - probably this is time to try this approach again I mean use dynamic spawn workers but on reach maxWorkersCount - drop or return SERVFAIL.

/cc @rdrozhdzh

@tmthrgd
Copy link
Collaborator

tmthrgd commented Feb 28, 2019

I’m as strongly opposed to a fixed worker count and dropping responses as I was originally. It creates a magic tuning variable that must be perfectly chosen to use the server to its full potential. As it stands the long-lived worker approach is purely an optimisation.

I’m not at all convinced it would solve those problems in any case. It’s not clear to me where or why this problem is occurring.

@miekg
Copy link
Owner Author

miekg commented Mar 1, 2019 via email

@WinstonPrivacy
Copy link

@miekg My original approach regarding workers was to use fixed workers count and drop or return SERVFAIL if all workers are busy - probably this is time to try this approach again I mean use dynamic spawn workers but on reach maxWorkersCount - drop or return SERVFAIL.

I see the value in having auto-tuned configuration settings but being able to override them with a hard limit is something that is sorely needed in CoreDNS because auto-tuning doesn't always work (as in this case).

The requests are returned with SERVFAIL anyway, so there would be no harm in doing so (for our use case, at least).

We are working on a pprof. It may be difficult to do because this condition effectively hangs the box until CoreDNS is stopped.

@miekg
Copy link
Owner Author

miekg commented Mar 7, 2019

looking at the latest comments "over at coredns" I'm inclined to revert the worker PR

@miekg
Copy link
Owner Author

miekg commented Mar 7, 2019

@WinstonPrivacy having some magic parameters deep in this library is not a good way forward; once you start adding those knobs there is no way back.

@tmthrgd
Copy link
Collaborator

tmthrgd commented Mar 7, 2019

@miekg I agree totally. If there isn’t any small change that would make the worker code functional, I think reverting it is the only acceptable choice.

@johnbelamaric
Copy link

johnbelamaric commented Mar 8, 2019 via email

@tmthrgd
Copy link
Collaborator

tmthrgd commented Mar 8, 2019

For context the following gave us the worker code: #639, #565 & #664.

@UladzimirTrehubenka
Copy link
Contributor

Revert to spawn workers for each request just will make the behaviour worse.
Actually there are only two cases:
we have no any limit (or not very aggressive limit) - and we will see some spikes, OOM, etc;
we have some resource limit (memory, workers, etc) and just cannot handle some requests but have no any CoreDNS crashes.
We cannot have some resource limit and possibility to handle any request count at the same time.

1 similar comment
@UladzimirTrehubenka

This comment has been minimized.

@miekg
Copy link
Owner Author

miekg commented Mar 8, 2019 via email

@fturib
Copy link
Contributor

fturib commented Mar 8, 2019

Sorry .. I am late in this thread: I did not identified it was the some topic as coredns/coredns#2593 on another repo (but the right one).

IMO, we need to have more data to propose a direction.

I want now to check what happen when we provide a huge burst (from 0 to complete overwhelemed) - which is the initial problem related in the CoreDNS issue.

I would like also to see the effect in 3 cases:

  • no limit on go-routine (and no workers)
  • having workers to pickup the incoming query and drop any queries when no worker is available
  • maybe allowing workers + a capped number of go-routine to "swallow" a peak (not a steady burst).

And in these 2 use cases:

  • a immediate overwhelming burst
  • a somewhat slow ramp-up where system can stabilized itself before being overwhelmed.

The tests are in progress with @rajansandeep

NOTE: our tests are limited to k8s though ....

At the end if we want to have it proper for production in k8s, that most likely will need some "magic" numbers. As everyone do not have same usage and same environment, I guess those magic numbers should just be adjustable.

@johnbelamaric
Copy link

johnbelamaric commented Mar 9, 2019 via email

@miekg
Copy link
Owner Author

miekg commented Mar 9, 2019

#928 and #929 to revert that code.

We can then try again. These ad-hoc PRs to fix the bug du jour and not the correct way forward. As for magic constants: this was discussed at length and no acceptable justification for any number was given.

@semihalev
Copy link
Contributor

Fixed worker count like really bad idea. One of my public resolver server is working currently 1200 goroutines and very stable. You can check the snapshot metrics from https://snapshot.raintank.io/dashboard/snapshot/tDwYNnw0icrBCeTcO4gwsF29kMJD5e1h

@fturib
Copy link
Contributor

fturib commented Mar 9, 2019

@semihalev : your metrics shows you handle 1200 go-routines, but also 750 MBi.
I guess the whole point is here : as long as you can offer the needed amount memory then you can have any number of go-routine (waiting for serve, with a DNS msg in buffer) you want.

@fturib
Copy link
Contributor

fturib commented Mar 9, 2019

@miekg : I understand you already merged the revert of worker mechanism. And now we have direct spawn of go-routine to call serve(...) either UDP or TCP incoming msgs.

We still have to figure out if we want to be able to limit the spike of Memory or CPU when there is a burst of incoming queries ?
I mean this issue is still open .. correct ?

@miekg
Copy link
Owner Author

miekg commented Mar 9, 2019 via email

@miekg
Copy link
Owner Author

miekg commented Mar 9, 2019 via email

@miekg
Copy link
Owner Author

miekg commented Mar 9, 2019

[testing on packet w/ coredns and miekg/dns version 1.1.5]

Just forwarding queries leads to 100-200 goroutines and 45 MB RRS memory. Hitting a coredns with whoami only gives about same numbers. Forwarding does 20K qps and hitting server directly is 40Kqps.

CPU goes up to 2-3 CPU s/s which correlates nicely to how many CPUs this little box has. I see nothing worrisome what so ever in these figures.

@semihalev
Copy link
Contributor

@fturib the most of memory spike from caches.

@miekg
Copy link
Owner Author

miekg commented Mar 9, 2019

the memory @semihalev quotes is probably the virtual memory and hopefully you'll read the memory faq for Go: https://golang.org/doc/faq#Why_does_my_Go_process_use_so_much_virtual_memory

The actual memory is probably in the low 40's

@WinstonPrivacy
Copy link

For us, this is a CPU problem. Memory is never very excessive, even in the error state IIRC. Resident memory under a pretty active system that I'm on right now is just 105Mb.

image

@fturib
Copy link
Contributor

fturib commented Mar 9, 2019

@semihalev , @miekg : I agree, I wrote too quickly, 1.2k go-routines is not so much a trouble for memory. We can see here that we keep under 70MB with that amount of go-routines for serving msgs.

@fturib
Copy link
Contributor

fturib commented Mar 9, 2019

I will try to re-explain what we see on these tests of CoreDNS:

  • we limit the queries only to few domains : the queries goes only to the cache plugin, and the cache has only a 100's of entries filled.
  • the memory used is directly prop. of the go-routines
  • the number of go-routines can goes high (30k ... even 80 or 90k simultaneus routines).
  • each of these routines holds at least the incoming MSG

With no ability to throttle the number of go-routines, the application (I mean here CoreDNS) has no way to ensure its memory stay under a limit. And The crash happen by OOM.
NOTE: We see also it can happen for other reason (like livenessprobe unresponsive).

The global problem is not really for DNS but for the Application above: what strategy to use to ensure that you can continue to server DNS queries whatever the incoming query pressure ?

IMO, It comes down into DNS repo in this question : what is the option of dns.Server that I can tune by setup to ensure that I have a throttle on the incoming queries that is compatible with the absolute limit of Memory allowed for my application (CoreDNS).

Unless you think there should be another strategy for the application ?

@WinstonPrivacy
Copy link

I was curious about CPU and memory usage so I reproduced the problem again just now. I was wrong about what I said earlier... CPU increases to 100% (both cores) but memory utilization more than triples. It might have gone even higher but the board ran out of memory.

image

@tmthrgd
Copy link
Collaborator

tmthrgd commented Mar 9, 2019

Can someone try this approach on the CoreDNS side? You have to find the right spot, but you’d want it to be the first handler that is set directly on the Server.

@miekg
Copy link
Owner Author

miekg commented Mar 10, 2019

@WinstonPrivacy what are you testing (version/config/etc). How are you testing? And note your initial bug was a cascading failure and being stuck in a high cpu stage.

Im sorry, but htop screenshots don't give a lot of detail to work with. Also you asking a binary to do something; sure that will increase cpu and mem.

@tmthrgd thanks. I can test that; BIND9 has a similar knob for (I think) recursive requests only.

@WinstonPrivacy
Copy link

WinstonPrivacy commented Mar 10, 2019 via email

@willwp
Copy link

willwp commented Mar 10, 2019

@miekg here are the config/version details: ARM64 dual core 1GHz CPU, 1 GB RAM.
image: CoreDNS-1.2.5, linux/arm64, go1.11.2 , with this fix merged in: coredns/coredns#2319
Corefile

:53 {
       cache 3600
       forward . tls://1.1.1.1 tls://1.0.0.1 {
               tls_servername cloudflare-dns.com
               health_check 5s
               max_fails 1
               expire 90s
       }
       alternate SERVFAIL . tls://9.9.9.9 tls://149.112.112.112 {
                tls_servername dns.quad9.net
                health_check 5s
                max_fails 1
                expire 90s
       }
}
.:54 {
       cache 3600
       forward . tls://1.1.1.1 tls://1.0.0.1 {
               tls_servername cloudflare-dns.com
               health_check 5s
               max_fails 1
               expire 90s
        }
        alternate SERVFAIL . tls://9.9.9.9 tls://149.112.112.112 {
               tls_servername dns.quad9.net
               health_check 5s
               max_fails 1
               expire 90s
        }
}

The problem occurs as @WinstonPrivacy mentioned:

  1. Run another process that consumes resources, such a building a large go program.
  2. Query CoreDNS at a moderate rate, e.g. queries from various social media and streaming sites
  3. CoreDNS CPU stays at 80-100% , and memory demand increases.
  4. CoreDNS starts returning SERVFAIL for every query.
  5. The other resource heavy process stops, e.g. build or database queries stop.
  6. CoreDNS remains in high CPU and memory state, and returns SERVFAIL for every query, CoreDNS takes 2-3 hours to start responding NOERROR to queries.

In the first attached Prometheus screen shot I reproduced the problem at 5pm, go routines spike to > 2000, and memstate_sys_bytes climb, SERVFAIL is returned from queries until 7:45 pm. The other resource hungry process that kicked off the high CPU/SERVFAIL query response had only ran for 5-10 minutes. The next day I reproduce the problem at 9pm, and the go routines spike on the end of the graph.
go_routines

The next graph shows process cpu_seconds and coredns request count. I'm doubting the accuracy of the process_cpu_seconds measurement because htop indicated CPU was pegged 80-100% the entire time.
process_cpu_seconds_total

Please note we were not able to fetch results from the CoreDNS pprof plugin during the time CPU spiked and SERVFAIL for the queries.

@willwp
Copy link

willwp commented Mar 10, 2019

@miekg I want to add in interesting results when I disabled the worker pool in CoreDNS:
I disabled the worker pool as proposed by @tmthrgd in his comment .
Improvements noticed:

  • CoreDNS recovered from > 80% CPU after 2-5 minutes, after the other resource hungry process is done using CPU and memory resources.

  • CoreDNS was no longer stuck in SERVFAIL, and within 2-5 minutes starts resolving DNS queries.

Before @tmthrgd's change it would take CoreDNS > 2 hours CPU to reduce, and for DNS queries to return NOERROR, after the catalyzing heavy build or database query were stopped.

@miekg
Copy link
Owner Author

miekg commented Mar 10, 2019 via email

@WinstonPrivacy
Copy link

WinstonPrivacy commented Mar 10, 2019 via email

@miekg
Copy link
Owner Author

miekg commented Mar 10, 2019 via email

@willwp
Copy link

willwp commented Mar 10, 2019

[ Quoting notifications@github.com in "Re: [miekg/dns] Request spike use C..." ]

Good news! Thanks. Can you compile a later coredns, maybe from master?

I'll go ahead and compile with from the master of coredns, reproduce, and measure.

In the above scenario you still observe a bunch of goroutines I assume?

I didn't have the Prometheus plugin/server running in the disabled worker pool case. I'll make sure and have it running when I repo, and share the results.

@fturib
Copy link
Contributor

fturib commented Mar 12, 2019

Just forwarding queries leads to 100-200 goroutines and 45 MB RRS memory. Hitting a coredns with whoami only gives about same numbers. Forwarding does 20K qps and hitting server directly is 40Kqps.

It is not what we get here, whatever release DNS 1.15 or 1.1.6 (we tested with both) - which is inline with the problem of go-routines we identified.
However we are in Kubernetes environment and that maybe do the difference.

I will try the proposition of @tmthrgd here, hoping that will help to see if the go-routine problem is when used by CoreDNS only (maybe these routines are stuck in a part of CoreDNS, holding some memory).

NOTE: the pprof output are available on these tests, but I am not sure to be able to interpret properly, as the memory notified in the pprof is only 10% of the one notified by prometheus.

@miekg
Copy link
Owner Author

miekg commented Mar 12, 2019 via email

@tmthrgd
Copy link
Collaborator

tmthrgd commented Mar 12, 2019

I’d like to see if my suggestion of rate limiting on the CoreDNS side can solve this, if not I’ve got a potential minimal change to this repo, that would allow rate limiting to be added externally without being invasive.

@fturib
Copy link
Contributor

fturib commented Mar 12, 2019

@tmthrgd : I am working on it. Will let you know. If works, I agree that we can maybe just add as a CoreDNS plugin.

@miekg
Copy link
Owner Author

miekg commented Mar 13, 2019 via email

@miekg miekg closed this as completed Mar 13, 2019
@WinstonPrivacy
Copy link

WinstonPrivacy commented Mar 26, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants