Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional blocking when using Provide() #453

Closed
MichaelMure opened this issue Feb 17, 2020 · 29 comments
Closed

Occasional blocking when using Provide() #453

MichaelMure opened this issue Feb 17, 2020 · 29 comments

Comments

@MichaelMure
Copy link
Contributor

I have a worker node specialized in DHT Provide-ing based on go-ipfs core with the last DHT released (go-libp2p-kad-dht v0.5.0, go-ipfs 8e9725fd0009 (master from Feb 7)). What I do on this worker is a highly concurrent Provide (up to 5k concurrent tasks) to publish a high number of values in a reasonable amount of time.

While this initially work quite well, I found out that some Provide get stuck over time, even though I pass them a context with a 5 minutes timeout. To detect that problem I setup an extra goroutine for each task to check at 5 minutes 30s if the Provide returned.

Here is an example run:

image

As you can see, there is indeed stuck Provide. As those never return, they eventually bring down the publish rate as they don't pick up new value to publish. Btw, don't be so concerned by the progressive ramping up at the beginning, it's just the worker spawning new tasks gradually to avoid congestion.

To track down further this problem I wrote some code to, once a first blocking is detected, change the concurrency factor to zero. The result is that all the sane Provide complete and get removed, leaving only the blocked ones. When this happen, I found that hundreds of those are left, even though the teardown start as soon as the first is found.

After the all the sane Provide returned, I took a goroutine dump:

raw goroutine dump
deduped with goroutine-inspect (notice the duplicate count for each stacktrace)

Here is what I could infer from that:

  • 13k goroutines left
  • still a LOT of stuck tasks according to the logs
  • 2827 in bitswap
  • 1730 in DHT's handleNewMessage --> go-msgio --> read IO
  • 6 in DHT's dial queue
  • 2831 in yamux
  • 118 in bitswap's blockstoremanager
  • 2812 in another place in yamux
  • 2830 in swarm c.conn.AcceptStream()

A few points:

  • it happens more with high concurrency but I can't tell if it's because of the concurrency or simply because there is more occasion to trigger the failure
  • once a first blocking happen, it's likely that more will follow quickly. Again, this can be a cascade of event or simply because the condition to trigger are met.
  • despite all of that, Provide never return an error.
  • this could simply be a problem in the code I use to detect the blocking but this code looks correct and detected blocking and decrease in publish rate correlate really well
  • that said, I couldn't find the place where the Provide actually block. I suspect a congestion or deadlock of some sort but can't pinpoint where.
@MichaelMure
Copy link
Contributor Author

Same problem as in ipfs/kubo#3657 ?

@Stebalien
Copy link
Member

Could I get some stack traces when this buildup is happening? The yamux, handleNewMessage, etc. goroutines are usually sleeping goroutines waiting on streams/connections.

However, there are a couple of places where this could be happening:

  • We're not handling the context when writing to the stream. Buffering usually takes care of this but we should try to do something smarter.
  • To reduce the overhead of recreating streams, we're using one stream per peer instead of one stream per request. To manage access to these streams, we're using locks. That's likely causing us to block, waiting for other requests to finish. See DHT Request pipelining #92 and friends.

Stebalien added a commit that referenced this issue Feb 22, 2020
Related to #453 but not a fix. This will cause us to actually return early when
we start blocking on sending to some peers, but it won't really _unblock_ those
peers. For that, we need to write with a context.
Stebalien added a commit that referenced this issue Feb 22, 2020
Related to #453 but not a fix. This will cause us to actually return early when
we start blocking on sending to some peers, but it won't really _unblock_ those
peers. For that, we need to write with a context.
@MichaelMure
Copy link
Contributor Author

A few days ago I noticed that I had a bug in my auto-scaling code resulting in occasionally not counting properly the tasks terminating, so I entirely rewrote this part into a much cleaner and more importantly, well tested package to remove that variable from this equation. While doing that I also wrote better telemetry and better stuck task detection. Here is what I found out.

  • Provide() calls do get stuck, but not in the sense that I thought. They do outlive the context significantly but eventually return. As you can see in the following, there is no stuck task piling up over time, but a significant number outlive their context:
    image

Here is a sample distribution of how long it takes to return after the context done:
image

Not surprisingly, using a shorter timeout (40s) significantly increase the number of stuck tasks. It looks like they just finish naturally, without any care for the context deadline. @Stebalien I see your fix/observe-context-in-message-sender branch, I'll give it a try.

  • Despite different scaling parameters and timeout, all the workers follow a fairly similar pattern in term of publishing speed:
    image

This pattern is the result of the auto-scaling doing it's thing:
image

This scaling down is to keep a relatively constant CPU usage:
image

Eventually, this means that Provide() globally require relatively suddenly more CPU to do the same thing (again, the correlation in the timing is surprising):
image

During all that, the time it takes to publish a single value stay quite constant:
image

Just for the sake of completion, memory usage stay constant once it reaches a plateau:
image

  • Another thing I found out: the OpenCensus metrics in the DHT are misleading, at least to me. ReceivedMessages, SentMessages and SentRequests record the number of valid messages or requests due to the early return on error. So for example graphing a ratio error/total will be completely bogus (notice how it goes over 100%).
    image

To finish, here is a deduped goroutine dump of a worker under load, past the point of increased CPU usage:
goroutines-on-load-dedup.txt

.. and two pprof CPU profiles: one from a recently launched worker, one from a long running one.
profile-short-running.txt
profile-long-running.txt
--> go tool pprof --http=:8123 -nodecount=10000 /tmp/profile-long-running.txt

@aschmahmann
Copy link
Contributor

aschmahmann commented Feb 24, 2020

@MichaelMure my instinct here is that it's not a coincidence that CPU usage increases as the memory plateaus. It may be possible that Go is starting to aggressively run the garbage collector in order to free up memory and that is eating up all your CPU cycles. How much total memory does your box have and how much is being used by Go (e.g. by running top)?

@Stebalien
Copy link
Member

Apparently I forgot to actually PR a fix: #462

Note: This is not a real fix. The real underlying issue is that we don't pipeline or open multiple streams.

@MichaelMure
Copy link
Contributor Author

@aschmahmann I think you are likely correct but what is actually happening still escape me. This is an EC2 instance with 8GB of memory. As you can see, the memory usage plateau close to that value. That said, this datadog metric is a bit weird as it report the used memory + the cached memory (the one used for disk caching that can be freed anytime). When ssh-ing, free -mh report 6.1GB available, htop a total usage of 3.8G/7.54G and 3.4G RES for the actual process.

Digging further, pprof report the following:
image

So, 870MB instead of 3.4G, that's a lot of memory not accounted for. It doesn't explain either why that becomes a problem even though there is supposedly more RAM to use if needed.

I guess my next move is to collect expvar metrics and see if something show up.

@MichaelMure
Copy link
Contributor Author

@Stebalien FYI, it seems that #462 is doing a good job at reducing the stuck tasks problem

Before:
image

After:
image

Some are still not returning properly though (this is the duration after the context deadline):
image

@Stebalien
Copy link
Member

Could I get another goroutine dump after the patch?

Also note, that patch doesn't really fix the problem, unfortunately. The real problem is that these calls are blocked until they time out.

@MichaelMure
Copy link
Contributor Author

Here it is: goroutine-dedup.txt

@MichaelMure
Copy link
Contributor Author

MichaelMure commented Feb 27, 2020

Sorry if this is turning into a debugging session, but looking at a heap profiling, time.NewTimer is standing out:
inuse_space
image
inuse_object
image

That's a lot of timer still in use. Could that be a leak ? This particular worker has been running for two days, with about 3k concurrent provide at the moment.

Edit: go tool pprof --http=:8123 -nodecount=10000 heapdump.txt - heapdump.txt
(the providerManager also stand out but that's because I'm using an in-memory datastore for that at the moment, I'm going to change that)

@MichaelMure
Copy link
Contributor Author

FYI, I also did a test with go-1.13.5 vs go-1.14.0, three instances of each and the following behavior is consistent: go-1.14.0 make things worse.

go 1.13:
image

go 1.14:
image

As the auto-scaling maintain a constant CPU usage, the task count, already lower at the start, plummet into ~1/5 of what it is with go 1.13. In turn as the concurrency is lower, a bunch of other metrics are affected on go 1.14: less and shorter GC pause, ~1/2 the heap allocation, same for the RES memory.

@Stebalien
Copy link
Member

That's a lot of timer still in use. Could that be a leak ? This particular worker has been running for two days, with about 3k concurrent provide at the moment.

Yes. #466

Sorry if this is turning into a debugging session, but looking at a heap profiling

No, this is great! Well, sort of. We're about to replace that exact code so it's kind of a non-issue but we might as well fix it.

@Stebalien
Copy link
Member

Stebalien commented Feb 28, 2020

Here it is: goroutine-dedup.txt

(FYI, normal stack traces are preferred as they can be parsed with https://github.com/whyrusleeping/stackparse/)

Also, gives me how long it's stuck.

@Stebalien
Copy link
Member

Stebalien commented Feb 28, 2020

What datastore are you using?

edit: Oh, you're using a map one. Try using an in-memory leveldb. Finding providers in a map datastore is very slow.

@Stebalien
Copy link
Member

There are 2000 stuck dials. Did you change some dial limits somewhere?

@Stebalien
Copy link
Member

I think #467 will fix the leak.

@MichaelMure
Copy link
Contributor Author

Thanks for the fixes 👍

(FYI, normal stack traces are preferred as they can be parsed with https://github.com/whyrusleeping/stackparse/)
Also, gives me how long it's stuck.

Ha, that's the one I keep losing track of :)
Here is one: goroutine.zip

It's not happening often now with #462: "only" 377 times in 12h which makes it hard to catch on a goroutine dump (and you can't automate that as there is no way to take a stacktrace of another goroutine as far as I know). Here is the distribution (again, this is after a 150s context timeout):
image

Oh, you're using a map one. Try using an in-memory leveldb. Finding providers in a map datastore is very slow.

I just changed to an on-disk leveldb store. I'll keep an eye on it but I don't think the extra latency will be a problem.

There are 2000 stuck dials. Did you change some dial limits somewhere?

Connection manager is set to 2000/3000.

@Stebalien
Copy link
Member

Connection manager is set to 2000/3000.

We have a file limit for concurrent file-descriptor consuming dials inside the swarm. I may have misread something, but it looked like we had 2000 concurrent dials which should be impossible (we limit to ~100).

But I'll look again.

@Stebalien
Copy link
Member

@MichaelMure
Copy link
Contributor Author

MichaelMure commented Feb 28, 2020

I have LIBP2P_SWARM_FD_LIMIT set to 5000. Not entirely sure that make sense though.

edit: the reasoning was that we are running powerful machine in good networking condition and are only doing DHT publishing/bitswap. It should be ok to unleash more throughput.

@Stebalien
Copy link
Member

Ah, yes, that's it.

That should be fine and is probably good for your use-case. I was just confused (and a bit concerned) when I saw so many parallel TCP dials.

@MichaelMure
Copy link
Contributor Author

FYI, with the fixed metrics of #464, this is what I get:
image
So a fairly low (?) error ratio for sendMessage and sendRequest but a quite high 40% for receiveMessage (that handle both messages and request if I'm not mistaken). I dug a bit on that and almost 100% of those are error reading message: stream reset errors.

I'll let you figure out if that is 1) a sign of an actual problem and 2) if those should indeed be recorded as error.

@Stebalien
Copy link
Member

but a quite high 40% for

Thanks for highlighting that. That's normal and we shouldn't be recording that as an error. That's the error we get when the remote side resets the stream (e.g., the connection is closing, they no longer need to make requests, etc.).

@MichaelMure
Copy link
Contributor Author

I just changed to an on-disk leveldb store. I'll keep an eye on it but I don't think the extra latency will be a problem.

I'm taking that back, using an on-disk datastore instead of MapDatastore seems to have reduced the average publish time by >10s.

image

@MichaelMure
Copy link
Contributor Author

Err, disregard that, it's unrelated. It seems that this worker suddenly removed >1k peers (most likely the connection manager but it went lower than usual), resulting in a way way better received message error ratio.

image
image
Zoomed in:
image
So, some especially bad peer(s?) got removed resulting in a much saner routing? It's be interesting to know how to detect and filter them.

@Stebalien
Copy link
Member

Very interesting... Are you sure your ISP/router didn't just start rate-limiting your dials?

@MichaelMure
Copy link
Contributor Author

It's running on AWS EC2 so that's unlikely.

@MichaelMure
Copy link
Contributor Author

Not so much related to this problem (which is pretty much solved btw), but as this turned into a dev diary ...

I noticed an oddity: the received message/request error ratio is daily periodic, with the lowest point at around 9PM UTC.
image

@MichaelMure
Copy link
Contributor Author

Ignoring the mentioned oddities, actual problems have been resolved. Closing.

Thanks for the help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants