New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock caused by lock in static constructor #1193

Closed
reima opened this Issue Aug 17, 2018 · 14 comments

Comments

Projects
None yet
2 participants
@reima

reima commented Aug 17, 2018

Summary

I've found a rather subtle deadlock condition in Noda Time that can be triggered by creating patterns from different threads.

Reproduction

Here's a minimal program that reproduces the issue:

using NodaTime.Text;
using System.Diagnostics;
using System.Globalization;
using System.Threading;

namespace NodaTimeBug
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            var enUsThread = new Thread(
                () =>
                {
                    Debugger.Break();
                    LocalDateTimePattern.Create("ddMMyyyy", new CultureInfo("en-US", false));
                }) { Name = "en-US" };

            var invariantThread = new Thread(
                () =>
                {
                    LocalDateTimePattern.Create("ddMMyyyy", CultureInfo.InvariantCulture);
                }) { Name = "Invariant" };

            enUsThread.Start();
            invariantThread.Start();

            enUsThread.Join();
            invariantThread.Join();
        }
    }
}

The program may not deadlock every time, as there is a race condition involved. To reproduce the deadlock consistently, follow these steps in Visual Studio:

  1. Create a breakpoint at
    if (dictionary.TryGetValue(key, out TValue value))
  2. Start debugging.
  3. The Debugger.Break() should get hit first. While the program is paused, freeze the thread named "en-US" and continue.
  4. The breakpoint in Cache.cs should get hit next. While the program is paused, freeze the thread named "Invariant", and thaw the thread named "en-US". Then continue.
  5. The breakpoint in Cache.cs should get hit again. Continue.
  6. Wait for a second or so and pause the program. Thaw the thread "Invariant". All threads should now be running again. Continue.
  7. Observe that the program is now deadlocked.

Cause

Suppose the thread "Invariant" starts running first. It eventually calls into NodaFormatInfo.InvariantInfo.ParsePattern, which calls Cache.GetOrAdd on its cache field. Cache.GetOrAdd locks on its mutex. Suppose the thread is now preempted. Note that NodaFormatInfo.InvariantInfo.cache.mutex is locked by the thread "Invariant".

Thread "en-US" now runs. It also first goes through a Cache.GetOrAdd, which locks a mutex. But that's not relevant here, as this is a different mutex (from the NodaFormatInfo for the "en-US" culture). The important part is that LocalDateTimePatternParser.ParsePattern is called through the valueFactory of the Cache object. But the method cannot run immediately, because it references static fields from LocalDateTimePattern.Patterns that have not yet been initialized. That means that the CLR must call the static constructor of LocalDateTimePattern.Patterns first.

To ensure the static constructor is called only once, the CLR acquires an unique lock for the static constructor of LocalDateTimePattern.Patterns. The static initialization calls into LocalDateTimePattern.CreateWithInvariantCulture, which calls into NodaFormatInfo.InvariantInfo.ParsePattern, which calls Cache.GetOrAdd on its cache field. This method tries to lock its mutex. This mutex is currently locked by the thread "Invariant", which means the thread "en-US" blocks. Note that the unique lock for the static constructor of LocalDateTimePattern.Patterns is still held by the thread "en-US".

Suppose the thread "Invariant" is now scheduled again. It is still in the Cache.GetOrAdd method. Similar to the other thread, LocalDateTimePatternParser.ParsePattern is called through the valueFactory of the Cache. For the same reasons as before, the CLR tries to acquire the unique lock for the static constructor of LocalDateTimePattern.Patterns. But this lock is held by the thread "en-US". Hence the deadlock.

Here are the stack traces for both threads at the time of the deadlock:

Invariant

[Managed to Native Transition]
NodaTime.dll!NodaTime.Text.FixedFormatInfoPatternParser<NodaTime.LocalDateTime>..ctor.AnonymousMethod__0(string patternText) Line 24
NodaTime.dll!NodaTime.Utility.Cache<string, NodaTime.Text.IPattern<NodaTime.LocalDateTime>>.GetOrAdd(string key) Line 60
NodaTime.dll!NodaTime.Text.FixedFormatInfoPatternParser<NodaTime.LocalDateTime>.ParsePattern(string pattern) Line 28
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, NodaTime.Globalization.NodaFormatInfo formatInfo, NodaTime.LocalDateTime templateValue) Line 169
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, System.Globalization.CultureInfo cultureInfo, NodaTime.LocalDateTime templateValue) Line 191
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, System.Globalization.CultureInfo cultureInfo) Line 204
NodaTimeBug.exe!NodaTimeBug.Program.Main.AnonymousMethod__0_1() Line 23
mscorlib.dll!System.Threading.ThreadHelper.ThreadStart_Context(object state) Line 41
mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) Line 293
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) Line 268
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 261
mscorlib.dll!System.Threading.ThreadHelper.ThreadStart() Line 60

en-US

mscorlib.dll!System.Threading.Monitor.Enter(object obj, ref bool lockTaken) Line 30
NodaTime.dll!NodaTime.Utility.Cache<string, NodaTime.Text.IPattern<NodaTime.LocalDateTime>>.GetOrAdd(string key) Line 43
NodaTime.dll!NodaTime.Text.FixedFormatInfoPatternParser<NodaTime.LocalDateTime>.ParsePattern(string pattern) Line 28
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, NodaTime.Globalization.NodaFormatInfo formatInfo, NodaTime.LocalDateTime templateValue) Line 169
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.CreateWithInvariantCulture(string patternText) Line 230
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Patterns.Patterns() Line 87
[Native to Managed Transition]
[Managed to Native Transition]
NodaTime.dll!NodaTime.Text.FixedFormatInfoPatternParser<NodaTime.LocalDateTime>..ctor.AnonymousMethod__0(string patternText) Line 24
NodaTime.dll!NodaTime.Utility.Cache<string, NodaTime.Text.IPattern<NodaTime.LocalDateTime>>.GetOrAdd(string key) Line 60
NodaTime.dll!NodaTime.Text.FixedFormatInfoPatternParser<NodaTime.LocalDateTime>.ParsePattern(string pattern) Line 28
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, NodaTime.Globalization.NodaFormatInfo formatInfo, NodaTime.LocalDateTime templateValue) Line 169
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, System.Globalization.CultureInfo cultureInfo, NodaTime.LocalDateTime templateValue) Line 191
NodaTime.dll!NodaTime.Text.LocalDateTimePattern.Create(string patternText, System.Globalization.CultureInfo cultureInfo) Line 204
NodaTimeBug.exe!NodaTimeBug.Program.Main.AnonymousMethod__0_0() Line 16
mscorlib.dll!System.Threading.ThreadHelper.ThreadStart_Context(object state) Line 41
mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) Line 293
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) Line 268
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 261
mscorlib.dll!System.Threading.ThreadHelper.ThreadStart() Line 60

Solution (?)

Generally speaking, blocking (directly or indirectly) inside a static constructor or initializer should probably be avoided as much as possible. As this example shows, it's not always obvious when a static constructor will run. This makes it very difficult to prevent deadlocks like this, as the order the locks are taken in can be hard to control.

So in my opinion the goal should be to eliminate blocking calls from static constructors (or initializers). Unfortunately I'm not versed enough in the intricacies of Noda Time to make that change myself or propose an alternative solution to this particular problem.

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 17, 2018

Humbug. Thanks for the detailed report - will dive into it over the weekend

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 17, 2018

Okay, I've now read through in more detail and I believe I understand. While it would be great not to have anything that needs to lock within the type initializers in Noda Time, there are a lot of "natural constants" that we wouldn't want to give up. (We've had a similar problem before, but with a slightly different cause.)

It feels to me like the biggest problem is that Cache.GetOrAdd calls arbitrary code within the lock. I'll need to think about how much of a problem it is (if at all) to call the value factory more times than necessary; if that's not a problem, then it shouldn't be too hard to fix.

Alternatively, I can look into whether ConcurrentDictionary is supported by all the platforms we now target, as that would be a simple fix as well.

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

Note: to observe the issue, I've had to change the pattern to "o" to ensure that the Patterns nested class is actually initialized. As it doesn't have a static constructor in the source code, the timing is implementation-specific.

With that change, I can reproduce the problem. I'm now going to try to reproduce it outside the debugger by adding some manual sleeps where we froze/thawed the threads.

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

Okay, I've reproduced this without breaking into the debugger until we think it's deadlocked by adding these changes:

  • In LocalDateTimePatternParser, just before it returns LocalDateTimePattern.Patterns.BclRoundtripPatternImpl, sleep for 4 seconds if we're on the invariant thread
  • In the test program, change Debugger.Break to Thread.Sleep(2000);

Importantly, this doesn't require changing the cache... which means I will be able to validate that when the cache implementation is changed, it works...

@reima

This comment has been minimized.

reima commented Aug 18, 2018

Jon, thank you for looking into this so quickly.

Note: to observe the issue, I've had to change the pattern to "o" to ensure that the Patterns nested class is actually initialized. As it doesn't have a static constructor in the source code, the timing is implementation-specific.

That's interesting, I did not know that. I should have mentioned that I tested this with .NET Framework 4.7.1.

ConcurrentDictionary is available almost anywhere (.NET Core >= 1.0, .NET Framework >= 4.0, .NET Standard >= 1.1), so it should be safe to use. But I think we would still need manual synchronization for the cache eviction, right?

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

Cache eviction is a little interesting, but I've got a solution which I think will work okay.
I've got the cache part done - I'm now making a few other bits of code safer. Are you happy for me to send you the PR for review shortly?

Another thing to work out: how/where we're going to release this.

  • Current release version is 2.3.0
  • master branch is for 3.0.0 which I don't expect to be released for a while
  • I'd like to do a new minor version at some point to add netstandard2.0 as a target

I was wondering whether to create a 2.4.0 branch now, cherry pick this work into there, and do a single release, leaving older versions broken. The alternative is to create a 2.3.1 and then do 2.4.0 separately. What do you think?

jskeet added a commit to jskeet/nodatime that referenced this issue Aug 18, 2018

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

@reima: The fix is in #1194 if you have time to look.

@reima

This comment has been minimized.

reima commented Aug 18, 2018

@jskeet: I've added some comments, but overall the solution looks good to me.

Regarding the release I'm fine with either 2.3.1 or 2.4.0, whatever works best for you.

@jskeet jskeet closed this in 422e77b Aug 18, 2018

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

Cool - I'll do a 2.4.0 "soonish" :)

jskeet added a commit to jskeet/nodatime that referenced this issue Aug 18, 2018

jskeet added a commit to jskeet/nodatime that referenced this issue Aug 18, 2018

@reima

This comment has been minimized.

reima commented Aug 18, 2018

Thank you again for the amazingly quick reaction and fix! My colleagues will be pleased to hear that hanging builds will soon be a thing of the past :)

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 18, 2018

It wouldn't have been nearly as quick a fix if you hadn't gone into such detail reproducing it. Incredibly helpful! I'm sorry you've run into this at all - the bug has probably been there since Noda Time 1.0 in 2012...

I'll try to get 2.4.0 released either tomorrow or some time over the next week. (It'll be good to have the netstandard2.0 target in there too - adding NodaTime in NuGet will be a lot less scary in terms of dependencies.)

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 19, 2018

@reima: 2.4.0 is now out - please let me know if you have any problems with it.

@reima

This comment has been minimized.

reima commented Aug 21, 2018

@jskeet: The upgrade to 2.4.0 went smoothly (we only have .NET Framework projects, so I can't say anything about the new .NET Standard target). And we've had no problems with deadlocks anymore so far :)

@jskeet

This comment has been minimized.

Member

jskeet commented Aug 21, 2018

Hooray - thanks for letting me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment