High percentage of threads blocked on DedicatedThreadPoolPipeScheduler.RunWorkLoop #28

robertmujica · 2019-03-20T20:17:26Z

Hi,

we are having an Issue in prod where we see a high percentage / number of Threads in waiting status, and when looking at full Net call stack, it is pointing to "Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()+99
--"

We were using version 1.0.9 and I came across this article https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html where you explain an issue you fixed on 1.1.*, which was very similar to what we were seeing in PROD so i went ahead and raised a PR and upgrading this NuGet to 2.0.7 and at same time I updated StackExchange.Redis from 2.0.519 to 2.0.571

However after deploying this new version and doing more testing we are still seeing high number of blocked threads of course way less than what we saw before but still happening.

this is call stack details, our app is running on Net472. if you need further details let me know.

[[GCFrame]]

[[HelperMethodFrame_1OBJ] (System.Threading.Monitor.ObjWait)] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()+99
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+163
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+14
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+52
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object)+5c
[[GCFrame]]
[[DebuggerU2MCatchHandlerFrame]]
[[ContextTransitionFrame]]
[[DebuggerU2MCatchHandlerFrame]]

ntdll!NtWaitForMultipleObjects+14

KERNELBASE!WaitForMultipleObjectsEx+f9
clr!WaitForMultipleObjectsEx_SO_TOLERANT+62
clr!Thread::DoAppropriateWaitWorker+1e4
clr!Thread::DoAppropriateWait+7d
clr!CLREventBase::WaitEx+c4
clr!Thread::Block+27
clr!SyncBlock::Wait+19d
[[GCFrame]]
clr!ObjectNative::WaitTimeout+e1
[[HelperMethodFrame_1OBJ] (System.Threading.Monitor.ObjWait)] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()+99
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+163
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+14
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+52
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object)+5c
clr!CallDescrWorkerInternal+83
clr!CallDescrWorkerWithHandler+4e
clr!MethodDescCallSite::CallTargetWorker+f8
clr!ThreadNative::KickOffThread_Worker+109
[[GCFrame]]
clr!ManagedThreadBase_DispatchInner+39
clr!ManagedThreadBase_DispatchMiddle+6c
clr!ManagedThreadBase_DispatchOuter+75
[[DebuggerU2MCatchHandlerFrame]]
clr!ManagedThreadBase_DispatchInCorrectAD+15
clr!Thread::DoADCallBack+278
[[ContextTransitionFrame]]
clr!ManagedThreadBase_DispatchInner+2fc3
clr!ManagedThreadBase_DispatchMiddle+6c
clr!ManagedThreadBase_DispatchOuter+75
[[DebuggerU2MCatchHandlerFrame]]
clr!ManagedThreadBase_FullTransitionWithAD+2f
clr!ThreadNative::KickOffThread+db
clr!Thread::intermediateThreadProc+86
kernel32!BaseThreadInitThunk+14
ntdll!RtlUserThreadStart+21

mgravell · 2019-03-20T20:37:25Z

This does not necessarily indicate a problem. Basically, to avoid thread-pool starvation issues impact the IO cycle, we use a dedicate thread-pool that is already ready with threads to process IO via the pipelines API. If there's nothing for those threads to do at the time, they'll be sat at a Wait, waiting for a pulse indicating that there's something for them to do.

The default number of threads we spin up here is: 10

So: are you actually seeing a problem here? or are you just seeing these threads sat quietly minding their own business?

Note: this isn't 10 per multiplexer/connection/etc; they are shared.

mgravell · 2019-03-20T20:52:46Z

BTW: an option here is to pass in a specific SocketManager; by default it uses a shared SocketManager, which uses 10 threads; but an individual SocketManager only holds 5 (the shared instance assumes it may see more activity). You should not, however, create too many SocketManager instances.

robertmujica · 2019-03-20T21:17:38Z

Yes that is what it seems to be happening these threads are waiting in Monitor.Wait to re-acquire a lock which they released. And it seems a Monitor.Pulse or Monitor.PulseAll is never called and when i looked at call stack they all are pointing me to this call:

Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()

So it means at some point the server is running out of threads and start getting request errors, slow response time, etc.

Regarding your suggestion above using SocketManager, would point me to some sample code that i can look at to see if this is an option for us ?

mgravell · 2019-03-20T21:31:25Z

the pulse should be being invoked whenever at least one worker is waiting, which is the case here; I do not believe that the server is running out of threads, but I'm more than happy to investigate - but can you please share what symptoms you are seeing that makes you think this is related? there being worker threads waiting for work does not mean anything - that is expected. It looks like I don't currently expose the queue length, but I expect it is zero; I've added a tweak (see linked commit above) to help me expose this, but it'll need a new client build to show it.

robertmujica · 2019-03-20T23:31:37Z

Thanks for Quick response on this issue, will have new dump files tomorrow to do further analysis so i will share more details. But so far I share same thoughts you have that this should not have a big impact in overall server performance. Yes i can see your commit.

bbijesh · 2019-03-21T14:35:02Z

I also observed similar behavior. In one of the posts above you mentioned that the default thread count is 10. I see in the process dump that around 23 threads are blocked here and in another dump 14 threads. Is it expected that the thread count is increased like this?

robertmujica · 2019-03-21T14:43:38Z

That is the exact issue we are having. We dont see it normal. We are doing more test / analysis today with without this nuget.

…

On Thu 21 Mar 2019, 14:35 Bijesh B, ***@***.***> wrote: I also observed similar behavior. In one of the posts above you mentioned that the *default thread count is 10*. I see in the process dump that around *23* threads are blocked here and in another dump *14* threads. Is it expected that the thread count is increased like this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADpFt-0MUqVMkp4tKYblsRkvst2YEYkDks5vY5iXgaJpZM4cALhh> .

bbijesh · 2019-03-21T14:51:30Z

Without this NuGet I don't expect that kind of issue as the specific thread pool itself is expected to not exist. Anyway, please keep this thread updated. Thanks!

mscrivo · 2019-03-21T16:40:37Z

Just to add another data point to this. We just upgraded from SE.Redis 1.2.6 to 2.0.571. All was fine in our tests and on our QA env, but when it hit Prod, our worker threads which use Redis a lot, thus placing it under fairly high load, randomly stop processing anything until the process is restarted. ie. the CPU on the boxes goes down to virtually 0 and they just sit there not doing anything.

There were no other code changes other than the Redis client upgrade. Taking a memory dump when it happens, I see the same 10 threads just waiting in RunWorkLoop. I understand that is normal when there is nothing to be done, but the question is why does work suddenly cease? Is there a deadlock happening somewhere?

The dump shows no other abnormalities, so I'm really scratching my head here. We are seeing a handful of RedisTimeoutExceptions since the upgrade, but I recall @mgravell saying that's likely because the 2.x branch now knows how to report on timeouts in async scenarios, whereas 1.x didn't, so those just may be a red herring.

robertmujica · 2019-03-21T17:22:35Z

one more detail regarding our implementation that we believe maybe causing this issue of leaving many threads in waiting state, is that we are subscribing to 8 Channels ( one for each app ).

We also have two Tasks: one running every 15 seconds on each app to save hash values to Redis Cache and a second thread every 30 seconds to try to sync hash across apps again using Redis Cache.

Do you think this sort of logic maybe the cause of this issue ? we are also thinking about simplifying this logic to use one channel and including application name as part of the message so it makes better use of Thread resources.

Any thoughts ?

mgravell · 2019-03-21T18:01:16Z

The number of channels etc shouldn't be a factor here. If folks are saying there is something odd happening, then I will of course investigate, but: I would really like to focus on the symptoms rather than the investigations. So: I am really really interested in this, but rather than talking about people seeing waiting threads (which does not mean much): I'd really love to hear more about the performance problems and scenarios that you're seeing. Does that make sense?

robertmujica · 2019-03-21T19:32:48Z

Basically

Overtime response time increases. e.g. our website should return within 2 seconds, but it starts taking 5, 9 and more seconds to respond.
And eventually it stops processing more requests and even web process crashes, here we have taken manual dump snapshot of the server and have found those high number of threads in waiting state.

Following is an extract of what we seeing in those dump files when issue happens,

notice that 10.64 % threads blocked when clicking on any of them the .Net call stack is like this

Notice this was taken when we had in PROD version 1.0.9

mscrivo · 2019-03-21T20:03:53Z

Essentially same symptoms here, things just stop processing after a while, but mem dumps don't indicate why, other than the waiting threads as @robertmujica points out. We have some sync over async code in the redis path, so we're working on resolving those. Will report back tomorrow if it made any difference. Open to suggestions on how to dig deeper here.

For reference, our code is running on .net core 2.2 on Windows 2019 VM's against 3 node Redis cluster on version 5.0.3.

bbijesh · 2019-03-22T02:47:57Z

31% of threads blocked is quite high. We are running on Azure WebApp, .net framework 472 .

mgravell · 2019-03-22T02:58:33Z

Agreed. I would expect no more than 10, and 10 would *not* indicate a problem. I will need to investigate, which we will do.

…

On Thu, 21 Mar 2019, 19:47 Bijesh B, ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/19475646/54797611-da145f00-4c7a-11e9-9091-11213b750624.png> 31% of threads blocked is quite high. We are running on Azure WebApp, .net framework 472 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABDsEgAl1yBnmcnv9iDM4AuwOKUCHFHks5vZERegaJpZM4cALhh> .

bbijesh · 2019-03-22T04:05:08Z

We took more dumps and see that 20 threads are waiting at
Pipelines_Sockets_Unofficial!Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()+84
Looks like 2 threadpools are getting created then.

robertmujica · 2019-03-24T22:00:30Z

analyzing some of the recent logs ( from 22nd of March ) I can see following entries:

Type: System.InvalidOperationException
Message: Reading is not allowed after reader was completed.
Stack:
[HelperMethodFrame]
System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_NoReadingAllowed()
System.IO.Pipelines.Pipe.AdvanceReader(System.SequencePosition ByRef, System.SequencePosition ByRef)
System.IO.Pipelines.Pipe+DefaultPipeReader.AdvanceTo(System.SequencePosition, System.SequencePosition)
StackExchange.Redis.PhysicalConnection+d__108.MoveNext()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.Execute(System.Action`1<System.Object>, System.Object)
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler+<>c.<.cctor>b__41_0(System.Object)
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
System.Threading.ThreadHelper.ThreadStart(System.Object)
[GCFrame]
[DebuggerU2MCatchHandlerFrame]
[ContextTransitionFrame]
[DebuggerU2MCatchHandlerFrame]

No i wondering if this is the side effect of having network connectivity issues ?

mscrivo · 2019-03-25T13:08:30Z

Reporting back as I said I would. Removing any remaining mixture of sync/async code in this path didn't seem to fix the issue we're having. I'm considering moving back to 1.2.6 just to verify the symptoms we're seeing are in fact caused by the new version.

Looking closer at our logs, to try and see if I can establish any patterns, what I'm seeing is that right before a server goes dead (ie. stops doing any work), I see a handful of RedisTimeoutExceptions like these:

Exception="StackExchange.Redis.RedisTimeoutException" ExceptionDetail="StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5235ms elapsed, timeout is 5000ms), inst: 0, qu: 0, qs: 1, in: 0, serverEndpoint: , mgr: 7 of 10 available, clientName: ASWorker9, IOCP: (Busy=5,Free=995,Min=1,Max=1000), WORKER: (Busy=10,Free=32757,Min=1,Max=32767), v: 2.0.571.20511 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

bbijesh · 2019-03-25T15:03:13Z

We have switched back to 1.2.6 and the behavior is much better.

mscrivo · 2019-04-04T14:56:06Z

@robertmujica curious if you tried upgrading to 2.0.588? We went back to 1.2.6 for a while after trying 571 (and it totally made the problem go away), then tried 588 and immediately upon deployment last night the problem of dead worker processes came back. Dumps are showing that 43% of threads are blocked, with one in particular that uses SE.Redis that is in a SpinWait.SpinOnce which I don't recall seeing before.

mgravell · 2019-04-04T15:32:56Z

@mscrivo if you're seeing a significant amount of SpinOnce - presumably because you're using a lot of "sync" paths in very high congestion: that changes in 593, which I've pushed to nuget (note to self: go do a release note).

the presence of a small number of threads waiting by themselves does not indicate a problem; those aren't pool threads - they are reserve threads intended to isolate the IO code from the peaks and troughs of the regular thread-pool.

mgravell · 2019-04-04T15:35:10Z

(you can also get the SpinOnce fix simply by taking pipelines.sockets.unofficial 2.0.20; all SE.Redis does here is take the later pipelines build)

mscrivo · 2019-04-04T15:36:47Z

The SpinOnce is coming from .net's BlockingCollection which uses a SemaphoreSlim ... we use BlockingCollections quite a bit in our workflow. Could that be a problem?

Happy to share a mem dump or the debug diag report with you, if you'd like to take a look? I'd admittedly not great with understanding .net's locking mechanisms.

mscrivo · 2019-04-04T15:39:12Z

ok will try that new build, thanks!

mgravell · 2019-04-04T15:40:00Z

Neither SE.Redis nor Pipelines.Sockets.Unofficial use BlockingCollection - however, Pipelines.Sockets.Unofficial does use SpinOnce() directly in the MutexSlim code; so - if the stalls you're seeing definitely have BlockingCollection in them, then: I think that's coming from your code.

As for "Could that be a problem?" - all I can say is "hell yeah" - SemaphoreSlim has a landmine embedded in it: https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html

robertmujica · 2019-04-04T15:40:21Z

@mscrivo no we kept 2.0.571, after further analysis i found in dump file Stackoverflow and OutofMemoryException that we believe is actual root cause of the issue we have. After i fixed the offending lines and Performance testing was done everything seems to be working ok, so next week we are going live with this change ( that is the real test ), will update this threads once i have some news from PROD.

robertmujica · 2019-04-04T15:42:07Z

One more thing we did try was to minimize the number of Channels we use for Pub / sub, but after some analysis we found no differences between listening for 8 channels or two, the number of Threads were pretty much the same, so we discarded that SPIKE.

mscrivo · 2019-04-04T15:43:31Z

thanks @mgravell. I did read your blog post when you posted it. It was very informative. I guess we have to look for an alternative to using BlockingCollections given we can't control what locking mechanism it uses.

mgravell · 2019-04-04T15:45:09Z

Ah, right - yeah there's an important update here that I had shared on the SE.Redis repos, but I hadn't looked for here:

there was a critical race condition when in MutexSlim - occasionally it screwed up and failed to notify the next competitor, which could result in perceived locks and timeouts. This is now addressed, and was the big change in SE.Redis 588; there is also a more recent 593 which makes the spin-lock a little less aggressive when contested by more than one competitor

mgravell · 2019-04-04T15:53:17Z

@mscrivo well, I wouldn't get too excited there without good reason, but... if you're sure you're seeing blocks from BlockingCollection then yeah. I don't know your use-case, but in most cases I've found that less is more - i.e. simple primitives like lock can be very effective in the majority of cases (where you aren't doing exotic things).

robertmujica · 2019-04-04T15:53:23Z

@mgravell so would you recommend to update to 593 or 588 version ?

mgravell · 2019-04-04T15:53:54Z

593, or I wouldn't have pushed it to nuget :)

mscrivo · 2019-04-11T13:23:45Z

Just wanted to report back that after we've eliminated all of our sync over async paths, our deadlocks disappeared with SE.Redis 2.

gokhanabatay · 2019-08-06T10:59:47Z

Hi mscrivo,
I think I have the same issue, I also checked with pre-release
Pipelines.Sockets.Unofficial 2.0.31.51448

StackTrace:

klmallory · 2019-08-23T16:59:28Z

I'm also still having this issue (601) on Batched or Pipelined requests in burst traffic situations.
We use a 9 instance 3 VM cross replicated cluster (no proxy). The packages are on average 10-20 KB. client has low CPU/ memory usage and only shows moderate bandwidth usage on the client during "sync" operations. I know that the package size is more than recommended, and some latency is expected, it's just that Sync and Async behave so differently and I would like to start batching requests and can't because of this issue.

This one line of code is continuously in a wait state.

Intrinsic Latency during test:

Max latency so far: 6 microseconds.
Max latency so far: 26 microseconds.
Max latency so far: 31 microseconds.
Max latency so far: 36 microseconds.
Max latency so far: 101 microseconds.
Max latency so far: 137 microseconds.
Max latency so far: 168 microseconds.
Max latency so far: 230 microseconds.
Max latency so far: 772 microseconds.

It seems like the multiplexer is waiting for other unrelated batches unnecessarily.

I have a simple console test to reproduce. The Parallel.For loops represent a burst traffic situation on the website where we are seeing this and replicates the situation fairly well.
https://gist.github.com/klmallory/da06bd68b5ebef92de72cd4597880c1a

Sync results are consistent:

Sync in 164ms on Thread 16
Sync in 173ms on Thread 20
Sync in 169ms on Thread 25
Sync in 175ms on Thread 22
Sync in 172ms on Thread 17
Sync in 176ms on Thread 1
Sync in 178ms on Thread 24

Batching / Async are all over the place:

Async Pipeline in 753ms on Thread 25
Async Pipeline in 754ms on Thread 46
Async Pipeline in 753ms on Thread 22
Async Pipeline in 754ms on Thread 20
Async Pipeline in 16ms on Thread 37
Async Pipeline in 16ms on Thread 23
Async Pipeline in 15ms on Thread 24
Async Pipeline in 20985ms on Thread 22
Async Pipeline in 20001ms on Thread 46
Async Pipeline in 20983ms on Thread 29
Async Pipeline in 8ms on Thread 22
Async Pipeline in 9ms on Thread 28
Async Pipeline in 10ms on Thread 35
Async Pipeline in 12ms on Thread 41
Async Pipeline in 12ms on Thread 33
Async Pipeline in 15ms on Thread 29
Async Pipeline in 14ms on Thread 25
Async Pipeline in 15ms on Thread 34

tiltsoftware · 2020-01-21T00:41:00Z

Hi! thanks for all the work on this and SE.Redis.

Question related to this. I am seeing the same issues with the RunWorkerLoop showing up as a very high percentage of threads blocked.

My symptoms are similar, CPU spikes and very slow loading asp.net/mvc pages. Sometimes more than a minute to load doing nothing and then suddenly the page loads. Site does not get much traffic and the traffic does not spike when this occurs.

But I am still on Redis 2.0.519 which references version 1.0.9 of this framework.

Could the issue that was mentioned in the blog https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html be causing my issues? considering I have the version that included those bugs?

I am currently testing SE.Redis 2.0.601 which is the current latest version in Nuget. I'm hoping that will resolve the issue I am having.

But possibly I'm doing something wrong in my implementation that's causing it and the blocks are just showing up in my profiler coincidentally or relatedly but not the root problem.

I'm also using Micrsoft.SignalR.StackExchange.Redis and Microsoft.Web.RedisSessionStateProvider that also use SE.Redis

Thanks

mgravell · 2020-01-21T07:14:52Z

"Could the issue that was mentioned in the blog https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html be causing my issues? considering I have the version that included those bugs?" quite possibly, but impossible for us to give a hard "yes"/"no" from here

…

On Tue, 21 Jan 2020 at 00:41, tiltsoftware ***@***.***> wrote: Hi! thanks for all the work on this and SE.Redis. Question related to this. I am seeing the same issues with the RunWorkerLoop showing up as a very high percentage of threads blocked. But I am still on Redis 2.0.519 which references version 1.0.9 of this framework. Could the issue that was mentioned in the blog https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html be causing my issues? considering I have the version that included those bugs? I am currently testing SE.Redis 2.0.601 which is the current latest version in Nuget. I'm hoping that will resolve the issue I am having. But possibly I'm doing something wrong in my implementation that's causing it and the blocks are just showing up in my profiler coincidentally or relatedly but not the root problem. I'm also using Micrsoft.SignalR.StackExchange.Redis and Microsoft.Web.RedisSessionStateProvider that also use SE.Redis Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28?email_source=notifications&email_token=AAAEHMGVUEOIEUXOH7GGASLQ6ZABZA5CNFSM4HAAXBQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOEF5Q#issuecomment-576471798>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMCVWU5M3ACUEQHXQNTQ6ZABZANCNFSM4HAAXBQQ> .

-- Regards, Marc

tiltsoftware · 2020-01-28T03:26:01Z

"Could the issue that was mentioned in the blog https://blog.marcgravell.com/2019/02/fun-with-spiral-of-death.html be causing my issues? considering I have the version that included those bugs?" quite possibly, but impossible for us to give a hard "yes"/"no" from here
…
-- Regards, Marc

Thanks for the reply Marc, I went ahead and updated a bunch of dependencies so I could bring us up to the latest version of this so that will eliminate whether that was our issue or not. I suspect it's somewhere in our implementation though but was worth doing to get everything up to date anyway.

ikalafat · 2020-07-23T08:45:50Z

Hi,

I'm experiencing (probably) same issue with Pipelines.Sockets.Unofficial 2.1.16 and SE.Redis 2.1.55, running on .NET 471

Here is screenshot from profiler, if it is of any help.

Is there anything that I can do to check/verify/help/fix this?

Thanks!

svenclaesson · 2020-09-02T14:35:04Z

Hi,

This might be of interest to someone.
We are currently experiencing these issues after upgrading to .NET core 3.1. We had no issues when running on .NET core 2.2.

https://i.imgur.com/cYQ92Wp.png

ahmedtolba1984 · 2020-10-22T01:13:57Z

@svenclaesson
We already have the same issue when upgrading to .NET Core 3.1
Did you solve this issue? and If Yes, How did you solve it?

Imjamper · 2020-12-22T10:40:02Z

Hi, same question, same issue.

@svenclaesson
We already have the same issue when upgrading to .NET Core 3.1
Did you solve this issue? and If Yes, How did you solve it?

svenclaesson · 2020-12-22T13:47:08Z

We never fully figured this out why but we got rid of all sync over async code and the CPU usage dropped significantly. Still not at the level with .NET Core 2.2 but we left it at that for now.

Hi, same question, same issue.

@svenclaesson
We already have the same issue when upgrading to .NET Core 3.1
Did you solve this issue? and If Yes, How did you solve it?

ikalafat · 2021-01-19T17:15:14Z

Hi @mgravell

is there anything we can do to circumvent this issue except removing all sync-over-async code? I've had a fun evening with the performance profiler (due to an unrelated issue) and noticed that 25% of CPU time from profiling session is accounted to Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop

Not all redis-related code is under my control, so I'm not sure whether I'll be able to remove all sync-over-async code.

Any other suggestions? Pipelines.Sockets.Unofficial 2.1.16 and SE.Redis 2.1.55, not sure were there any changes in SE.Redis lib in meantime.

Thanks.

korchak-aleksandr · 2021-09-03T15:51:56Z

Hello @mgravell

We have the same issue on .NET Core 3.1 using the latest version of StackExchange.Redis nuget.
JetBrains DotTrace profiling shows that very high percentage / number of Threads in waiting status in Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop

You said earlier that you would investigate this problem.
Could you tell current status of investigation?

We are looking forward to your reply and possible fix.
Thanks.

ThiagoT1 · 2021-11-25T01:48:19Z

We also have traces pointing to high CPU usage on this flow. We are running MS Orleans on K8s with the RedisGrainDirectory, and immediately after the deploy, even with no load, the issue arrises.

We are referencing version 2.2.88

Edit: we removed SE.Redis of the equation and the CPU kept high. Then we found a race in our code. We're fixing.

Is very weird that it didn't show in the profilings, though.

rkenshin · 2022-05-10T02:02:57Z

We also have traces pointing to high CPU usage on this flow. We are running MS Orleans on K8s with the RedisGrainDirectory, and immediately after the deploy, even with no load, the issue arrises.

We are referencing version 2.2.88

Edit: we removed SE.Redis of the equation and the CPU kept high. Then we found a race in our code. We're fixing.

Is very weird that it didn't show in the profilings, though.

Hey ThiagoT1, how did you find the race in your code, i am also using Orleans and running into similar issues, i can't find anything that suggest that our code is at fault so far. Thank you

roklenardic · 2022-09-07T07:59:38Z

Experiencing similar situation, app takes 6 minutes to start and when profiling with dotTrace this shows up:

Using nuget StackExchange.Redis 2.6.66 and the same was with 2.2.88. Using .NET framework 4.8

alex-jitbit · 2022-11-15T23:38:46Z

Seeing the same issue on .NET 6 with SE-.Redis latest version (2.6.70)

Has anyone came up with any solution?

NickCraver · 2022-11-16T00:01:49Z

For anyone seeing the above: this isn't a problem, it's a matter of understanding what the profiler is telling you. There are dedicated threads for the scheduler that aren't part of the thread pool and are not blocking other work. If they're in RunWorkLoop they also aren't using CPU - they are awaiting a monitor event for work being available.

These profilers are telling you "we took a lot of samples, here's what we saw in those samples and how many times". Seeing these threads idle and awaiting work is good, though misleading looking at a profiler like this which is based on samples and/or wall time. It's not using active CPU, etc. and can be ignored, it's just an artifact of how threads/waiting/profilers work.

ikalafat · 2022-11-22T10:24:33Z

@NickCraver thank you for the additional explanation. The "problem" is that people (including myself) are seeing this kind of results in the profiler, where large amount of time is "taken" by RunWorkLoop samples, and are unknowingly being misled by those results.

I still haven't found a way to "exclude" specific namespaces from profiler results (for example in DotTrace), so that actual timings, and percentages, are "recalculated" with exclusions in mind, where we would have more precise results, which would allow better profiling of app itself.

pengweiqhca · 2024-01-10T08:26:16Z

Use SocketManager.ThreadPool rather than SocketManager.Shared.

var options = ConfigurationOptions.Parse(configuration);

options.SocketManager = SocketManager.ThreadPool;

robertmujica changed the title ~~This thread is waiting in System.Threading.Monitor.Wait~~ High percentage of threads blocked on DedicatedThreadPoolPipeScheduler.RunWorkLoop Mar 20, 2019

mgravell added a commit that referenced this issue Mar 20, 2019

expose the number of pending work items in a queue - related to #28

e2d483c

ghost mentioned this issue Jan 25, 2022

Application Stops Responding ChilliCream/graphql-platform#4688

Closed

1 task

High percentage of threads blocked on DedicatedThreadPoolPipeScheduler.RunWorkLoop #28

High percentage of threads blocked on DedicatedThreadPoolPipeScheduler.RunWorkLoop #28

Comments

robertmujica commented Mar 20, 2019 • edited Loading

[[GCFrame]]

ntdll!NtWaitForMultipleObjects+14

mgravell commented Mar 20, 2019 • edited Loading

mgravell commented Mar 20, 2019 • edited Loading

robertmujica commented Mar 20, 2019

mgravell commented Mar 20, 2019

robertmujica commented Mar 20, 2019

bbijesh commented Mar 21, 2019

robertmujica commented Mar 21, 2019 via email

bbijesh commented Mar 21, 2019 • edited Loading

mscrivo commented Mar 21, 2019

robertmujica commented Mar 21, 2019

mgravell commented Mar 21, 2019

robertmujica commented Mar 21, 2019

mscrivo commented Mar 21, 2019

bbijesh commented Mar 22, 2019

mgravell commented Mar 22, 2019 via email

bbijesh commented Mar 22, 2019 • edited Loading

robertmujica commented Mar 24, 2019

mscrivo commented Mar 25, 2019 • edited Loading

bbijesh commented Mar 25, 2019

mscrivo commented Apr 4, 2019 • edited Loading

mgravell commented Apr 4, 2019

mgravell commented Apr 4, 2019

mscrivo commented Apr 4, 2019

mscrivo commented Apr 4, 2019

mgravell commented Apr 4, 2019

robertmujica commented Apr 4, 2019

robertmujica commented Apr 4, 2019

mscrivo commented Apr 4, 2019

mgravell commented Apr 4, 2019

mgravell commented Apr 4, 2019

robertmujica commented Apr 4, 2019

mgravell commented Apr 4, 2019

mscrivo commented Apr 11, 2019

gokhanabatay commented Aug 6, 2019 • edited Loading

klmallory commented Aug 23, 2019 • edited Loading

tiltsoftware commented Jan 21, 2020 • edited Loading

mgravell commented Jan 21, 2020 via email

tiltsoftware commented Jan 28, 2020

ikalafat commented Jul 23, 2020 • edited Loading

svenclaesson commented Sep 2, 2020

ahmedtolba1984 commented Oct 22, 2020

Imjamper commented Dec 22, 2020

svenclaesson commented Dec 22, 2020

ikalafat commented Jan 19, 2021

korchak-aleksandr commented Sep 3, 2021

ThiagoT1 commented Nov 25, 2021 • edited Loading

rkenshin commented May 10, 2022 • edited Loading

roklenardic commented Sep 7, 2022

alex-jitbit commented Nov 15, 2022 • edited Loading

NickCraver commented Nov 16, 2022

ikalafat commented Nov 22, 2022

pengweiqhca commented Jan 10, 2024

robertmujica commented Mar 20, 2019 •

edited

Loading

mgravell commented Mar 20, 2019 •

edited

Loading

mgravell commented Mar 20, 2019 •

edited

Loading

bbijesh commented Mar 21, 2019 •

edited

Loading

bbijesh commented Mar 22, 2019 •

edited

Loading

mscrivo commented Mar 25, 2019 •

edited

Loading

mscrivo commented Apr 4, 2019 •

edited

Loading

gokhanabatay commented Aug 6, 2019 •

edited

Loading

klmallory commented Aug 23, 2019 •

edited

Loading

tiltsoftware commented Jan 21, 2020 •

edited

Loading

ikalafat commented Jul 23, 2020 •

edited

Loading

ThiagoT1 commented Nov 25, 2021 •

edited

Loading

rkenshin commented May 10, 2022 •

edited

Loading

alex-jitbit commented Nov 15, 2022 •

edited

Loading