Occasional blocking when using Provide() #453

MichaelMure · 2020-02-17T13:21:27Z

I have a worker node specialized in DHT Provide-ing based on go-ipfs core with the last DHT released (go-libp2p-kad-dht v0.5.0, go-ipfs 8e9725fd0009 (master from Feb 7)). What I do on this worker is a highly concurrent Provide (up to 5k concurrent tasks) to publish a high number of values in a reasonable amount of time.

While this initially work quite well, I found out that some Provide get stuck over time, even though I pass them a context with a 5 minutes timeout. To detect that problem I setup an extra goroutine for each task to check at 5 minutes 30s if the Provide returned.

Here is an example run:

As you can see, there is indeed stuck Provide. As those never return, they eventually bring down the publish rate as they don't pick up new value to publish. Btw, don't be so concerned by the progressive ramping up at the beginning, it's just the worker spawning new tasks gradually to avoid congestion.

To track down further this problem I wrote some code to, once a first blocking is detected, change the concurrency factor to zero. The result is that all the sane Provide complete and get removed, leaving only the blocked ones. When this happen, I found that hundreds of those are left, even though the teardown start as soon as the first is found.

After the all the sane Provide returned, I took a goroutine dump:

raw goroutine dump
deduped with goroutine-inspect (notice the duplicate count for each stacktrace)

Here is what I could infer from that:

13k goroutines left
still a LOT of stuck tasks according to the logs
2827 in bitswap
1730 in DHT's handleNewMessage --> go-msgio --> read IO
6 in DHT's dial queue
2831 in yamux
118 in bitswap's blockstoremanager
2812 in another place in yamux
2830 in swarm c.conn.AcceptStream()

A few points:

it happens more with high concurrency but I can't tell if it's because of the concurrency or simply because there is more occasion to trigger the failure
once a first blocking happen, it's likely that more will follow quickly. Again, this can be a cascade of event or simply because the condition to trigger are met.
despite all of that, Provide never return an error.
this could simply be a problem in the code I use to detect the blocking but this code looks correct and detected blocking and decrease in publish rate correlate really well
that said, I couldn't find the place where the Provide actually block. I suspect a congestion or deadlock of some sort but can't pinpoint where.

The text was updated successfully, but these errors were encountered:

MichaelMure · 2020-02-17T16:29:51Z

Same problem as in ipfs/kubo#3657 ?

Stebalien · 2020-02-22T21:01:58Z

Could I get some stack traces when this buildup is happening? The yamux, handleNewMessage, etc. goroutines are usually sleeping goroutines waiting on streams/connections.

However, there are a couple of places where this could be happening:

We're not handling the context when writing to the stream. Buffering usually takes care of this but we should try to do something smarter.
To reduce the overhead of recreating streams, we're using one stream per peer instead of one stream per request. To manage access to these streams, we're using locks. That's likely causing us to block, waiting for other requests to finish. See DHT Request pipelining #92 and friends.

Related to #453 but not a fix. This will cause us to actually return early when we start blocking on sending to some peers, but it won't really _unblock_ those peers. For that, we need to write with a context.

MichaelMure · 2020-02-24T12:34:51Z

A few days ago I noticed that I had a bug in my auto-scaling code resulting in occasionally not counting properly the tasks terminating, so I entirely rewrote this part into a much cleaner and more importantly, well tested package to remove that variable from this equation. While doing that I also wrote better telemetry and better stuck task detection. Here is what I found out.

Provide() calls do get stuck, but not in the sense that I thought. They do outlive the context significantly but eventually return. As you can see in the following, there is no stuck task piling up over time, but a significant number outlive their context:

Here is a sample distribution of how long it takes to return after the context done:

Not surprisingly, using a shorter timeout (40s) significantly increase the number of stuck tasks. It looks like they just finish naturally, without any care for the context deadline. @Stebalien I see your fix/observe-context-in-message-sender branch, I'll give it a try.

Despite different scaling parameters and timeout, all the workers follow a fairly similar pattern in term of publishing speed:

This pattern is the result of the auto-scaling doing it's thing:

This scaling down is to keep a relatively constant CPU usage:

Eventually, this means that Provide() globally require relatively suddenly more CPU to do the same thing (again, the correlation in the timing is surprising):

During all that, the time it takes to publish a single value stay quite constant:

Just for the sake of completion, memory usage stay constant once it reaches a plateau:

Another thing I found out: the OpenCensus metrics in the DHT are misleading, at least to me. ReceivedMessages, SentMessages and SentRequests record the number of valid messages or requests due to the early return on error. So for example graphing a ratio error/total will be completely bogus (notice how it goes over 100%).

To finish, here is a deduped goroutine dump of a worker under load, past the point of increased CPU usage:
goroutines-on-load-dedup.txt

.. and two pprof CPU profiles: one from a recently launched worker, one from a long running one.
profile-short-running.txt
profile-long-running.txt
--> go tool pprof --http=:8123 -nodecount=10000 /tmp/profile-long-running.txt

aschmahmann · 2020-02-24T15:00:02Z

@MichaelMure my instinct here is that it's not a coincidence that CPU usage increases as the memory plateaus. It may be possible that Go is starting to aggressively run the garbage collector in order to free up memory and that is eating up all your CPU cycles. How much total memory does your box have and how much is being used by Go (e.g. by running top)?

Stebalien · 2020-02-24T15:26:19Z

Apparently I forgot to actually PR a fix: #462

Note: This is not a real fix. The real underlying issue is that we don't pipeline or open multiple streams.

MichaelMure · 2020-02-24T17:45:51Z

@aschmahmann I think you are likely correct but what is actually happening still escape me. This is an EC2 instance with 8GB of memory. As you can see, the memory usage plateau close to that value. That said, this datadog metric is a bit weird as it report the used memory + the cached memory (the one used for disk caching that can be freed anytime). When ssh-ing, free -mh report 6.1GB available, htop a total usage of 3.8G/7.54G and 3.4G RES for the actual process.

Digging further, pprof report the following:

So, 870MB instead of 3.4G, that's a lot of memory not accounted for. It doesn't explain either why that becomes a problem even though there is supposedly more RAM to use if needed.

I guess my next move is to collect expvar metrics and see if something show up.

MichaelMure · 2020-02-26T15:16:28Z

@Stebalien FYI, it seems that #462 is doing a good job at reducing the stuck tasks problem

Before:

After:

Some are still not returning properly though (this is the duration after the context deadline):

Stebalien · 2020-02-27T01:09:48Z

Could I get another goroutine dump after the patch?

Also note, that patch doesn't really fix the problem, unfortunately. The real problem is that these calls are blocked until they time out.

MichaelMure · 2020-02-27T10:45:31Z

Here it is: goroutine-dedup.txt

MichaelMure · 2020-02-27T11:22:57Z

Sorry if this is turning into a debugging session, but looking at a heap profiling, time.NewTimer is standing out:
inuse_space

inuse_object

That's a lot of timer still in use. Could that be a leak ? This particular worker has been running for two days, with about 3k concurrent provide at the moment.

Edit: go tool pprof --http=:8123 -nodecount=10000 heapdump.txt - heapdump.txt
(the providerManager also stand out but that's because I'm using an in-memory datastore for that at the moment, I'm going to change that)

MichaelMure · 2020-02-27T13:00:32Z

FYI, I also did a test with go-1.13.5 vs go-1.14.0, three instances of each and the following behavior is consistent: go-1.14.0 make things worse.

go 1.13:

go 1.14:

As the auto-scaling maintain a constant CPU usage, the task count, already lower at the start, plummet into ~1/5 of what it is with go 1.13. In turn as the concurrency is lower, a bunch of other metrics are affected on go 1.14: less and shorter GC pause, ~1/2 the heap allocation, same for the RES memory.

Stebalien · 2020-02-28T01:28:08Z

That's a lot of timer still in use. Could that be a leak ? This particular worker has been running for two days, with about 3k concurrent provide at the moment.

Yes. #466

Sorry if this is turning into a debugging session, but looking at a heap profiling

No, this is great! Well, sort of. We're about to replace that exact code so it's kind of a non-issue but we might as well fix it.

Stebalien · 2020-02-28T01:30:48Z

Here it is: goroutine-dedup.txt

(FYI, normal stack traces are preferred as they can be parsed with https://github.com/whyrusleeping/stackparse/)

Also, gives me how long it's stuck.

Stebalien · 2020-02-28T01:33:57Z

What datastore are you using?

edit: Oh, you're using a map one. Try using an in-memory leveldb. Finding providers in a map datastore is very slow.

Stebalien · 2020-02-28T01:38:08Z

There are 2000 stuck dials. Did you change some dial limits somewhere?

Stebalien · 2020-02-28T01:48:08Z

I think #467 will fix the leak.

MichaelMure · 2020-02-28T10:52:17Z

Thanks for the fixes 👍

(FYI, normal stack traces are preferred as they can be parsed with https://github.com/whyrusleeping/stackparse/)
Also, gives me how long it's stuck.

Ha, that's the one I keep losing track of :)
Here is one: goroutine.zip

It's not happening often now with #462: "only" 377 times in 12h which makes it hard to catch on a goroutine dump (and you can't automate that as there is no way to take a stacktrace of another goroutine as far as I know). Here is the distribution (again, this is after a 150s context timeout):

Oh, you're using a map one. Try using an in-memory leveldb. Finding providers in a map datastore is very slow.

I just changed to an on-disk leveldb store. I'll keep an eye on it but I don't think the extra latency will be a problem.

There are 2000 stuck dials. Did you change some dial limits somewhere?

Connection manager is set to 2000/3000.

Stebalien · 2020-02-28T15:43:02Z

Connection manager is set to 2000/3000.

We have a file limit for concurrent file-descriptor consuming dials inside the swarm. I may have misread something, but it looked like we had 2000 concurrent dials which should be impossible (we limit to ~100).

But I'll look again.

Stebalien · 2020-02-28T15:56:56Z

Are you absolutely sure you didn't change https://github.com/libp2p/go-libp2p-swarm/blob/158818154931f12368cc99787728e3bd27dff9ba/swarm_dial.go#L62?

MichaelMure · 2020-02-28T15:59:19Z

I have LIBP2P_SWARM_FD_LIMIT set to 5000. Not entirely sure that make sense though.

edit: the reasoning was that we are running powerful machine in good networking condition and are only doing DHT publishing/bitswap. It should be ok to unleash more throughput.

Stebalien · 2020-02-28T17:19:32Z

Ah, yes, that's it.

That should be fine and is probably good for your use-case. I was just confused (and a bit concerned) when I saw so many parallel TCP dials.

MichaelMure · 2020-02-28T17:22:46Z

FYI, with the fixed metrics of #464, this is what I get:

So a fairly low (?) error ratio for sendMessage and sendRequest but a quite high 40% for receiveMessage (that handle both messages and request if I'm not mistaken). I dug a bit on that and almost 100% of those are error reading message: stream reset errors.

I'll let you figure out if that is 1) a sign of an actual problem and 2) if those should indeed be recorded as error.

Stebalien · 2020-02-28T23:36:38Z

but a quite high 40% for

Thanks for highlighting that. That's normal and we shouldn't be recording that as an error. That's the error we get when the remote side resets the stream (e.g., the connection is closing, they no longer need to make requests, etc.).

MichaelMure · 2020-03-02T13:15:51Z

I just changed to an on-disk leveldb store. I'll keep an eye on it but I don't think the extra latency will be a problem.

I'm taking that back, using an on-disk datastore instead of MapDatastore seems to have reduced the average publish time by >10s.

MichaelMure · 2020-03-02T14:02:01Z

Err, disregard that, it's unrelated. It seems that this worker suddenly removed >1k peers (most likely the connection manager but it went lower than usual), resulting in a way way better received message error ratio.

Zoomed in:

So, some especially bad peer(s?) got removed resulting in a much saner routing? It's be interesting to know how to detect and filter them.

Stebalien · 2020-03-03T06:34:29Z

Very interesting... Are you sure your ISP/router didn't just start rate-limiting your dials?

MichaelMure · 2020-03-03T14:30:24Z

It's running on AWS EC2 so that's unlikely.

MichaelMure · 2020-03-23T17:24:13Z

Not so much related to this problem (which is pretty much solved btw), but as this turned into a dev diary ...

I noticed an oddity: the received message/request error ratio is daily periodic, with the lowest point at around 9PM UTC.

MichaelMure · 2020-05-04T13:19:00Z

Ignoring the mentioned oddities, actual problems have been resolved. Closing.

Thanks for the help :)

Stebalien mentioned this issue Feb 24, 2020

fix: obey the context when sending messages to peers #462

Merged

MichaelMure mentioned this issue Feb 27, 2020

metrics: record message/request event even in case of error #464

Merged

Stebalien mentioned this issue Feb 28, 2020

fix(dialqueue): fix a timer leak #466

Merged

MichaelMure mentioned this issue Mar 13, 2020

Daemon doesn't handle windows WM_CLOSE ipfs/kubo#1897

Closed

MichaelMure mentioned this issue Apr 28, 2020

Massively parallel provider slows down over time #619

Closed

MichaelMure closed this as completed May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional blocking when using Provide() #453

Occasional blocking when using Provide() #453

MichaelMure commented Feb 17, 2020

MichaelMure commented Feb 17, 2020

Stebalien commented Feb 22, 2020

MichaelMure commented Feb 24, 2020

aschmahmann commented Feb 24, 2020 •

edited

Loading

Stebalien commented Feb 24, 2020

MichaelMure commented Feb 24, 2020

MichaelMure commented Feb 26, 2020

Stebalien commented Feb 27, 2020

MichaelMure commented Feb 27, 2020

MichaelMure commented Feb 27, 2020 •

edited

Loading

MichaelMure commented Feb 27, 2020

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020 •

edited

Loading

Stebalien commented Feb 28, 2020 •

edited

Loading

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020 •

edited

Loading

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Mar 2, 2020

MichaelMure commented Mar 2, 2020

Stebalien commented Mar 3, 2020

MichaelMure commented Mar 3, 2020

MichaelMure commented Mar 23, 2020

MichaelMure commented May 4, 2020

Occasional blocking when using Provide() #453

Occasional blocking when using Provide() #453

Comments

MichaelMure commented Feb 17, 2020

MichaelMure commented Feb 17, 2020

Stebalien commented Feb 22, 2020

MichaelMure commented Feb 24, 2020

aschmahmann commented Feb 24, 2020 • edited Loading

Stebalien commented Feb 24, 2020

MichaelMure commented Feb 24, 2020

MichaelMure commented Feb 26, 2020

Stebalien commented Feb 27, 2020

MichaelMure commented Feb 27, 2020

MichaelMure commented Feb 27, 2020 • edited Loading

MichaelMure commented Feb 27, 2020

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020 • edited Loading

Stebalien commented Feb 28, 2020 • edited Loading

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020

Stebalien commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020 • edited Loading

Stebalien commented Feb 28, 2020

MichaelMure commented Feb 28, 2020

Stebalien commented Feb 28, 2020

MichaelMure commented Mar 2, 2020

MichaelMure commented Mar 2, 2020

Stebalien commented Mar 3, 2020

MichaelMure commented Mar 3, 2020

MichaelMure commented Mar 23, 2020

MichaelMure commented May 4, 2020

aschmahmann commented Feb 24, 2020 •

edited

Loading

MichaelMure commented Feb 27, 2020 •

edited

Loading

Stebalien commented Feb 28, 2020 •

edited

Loading

Stebalien commented Feb 28, 2020 •

edited

Loading

MichaelMure commented Feb 28, 2020 •

edited

Loading