SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

joshua-mng · 2021-04-20T17:04:11Z

I am using your library to implement a leader election algorithm, and it's working so well, thanks for the wonderful work.

Here is a certain section of my code:

protected override async Task OnElectAsync(CancellationToken cancellationToken)
{
    var dlock = _distributedLockManager.CreateLock($"{typeof(DistributedLockLeaderElectionProvider).FullName}/leader/8E7B4AC6-6C08-4467-801A-AA0AA76533AC/leader.lock");

    while (!cancellationToken.IsCancellationRequested) {
        var handle = await dlock.TryAcquireAsync(cancellationToken: cancellationToken);
        if (handle != null) {
            // save for later disposal when the current node stops or we lose the lock
            _lockHandle = handle;

            // register for handle loss
            if (handle.HandleLostToken.CanBeCanceled) {
                handle.HandleLostToken.Register(() => {
                    _lockHandle = null;

                    // lock connection lost, lease expired etc
                    // that means we are no longer leader
                    NotifyLeaderChanged(null);

                    StartElection("Lock handle lost");

                    // get rid of handle
                    try {
                        handle.Dispose();
                    } catch { }
                });
            }

            // lock acquired
            Logger.LogInformation($"Successfully aqcuired distributed leader lock, Instance Id={Constants.InstanceId.ToString("N")}, Instance Name={CommonHelper.GetAppFlag<string>(Constants.AppFlags.InstanceName)}");

            var others = Participants.GetOtherPeers(Cluster.CurrentPeer);
            Logger.LogDebug($"I am sending LEADER CHANGED message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
            try {
                await Cluster.Channel.SendMessageAsync(new LeaderChangedMessage {
                    LeaderAddress = Cluster.CurrentPeer.Address.ToParsableString()
                }, others, cancellationToken: cancellationToken);
            } catch (OperationCanceledException) {
            } catch (Exception ex) {
                Logger.LogError(ex, $"I could not send LEADER CHANGED message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
            }

            NotifyLeaderChanged(Cluster.CurrentPeer);

            // no need to loop, we acquired lock, we will keep it until shutdown
            break;
        } else {
            // couldn't acquire lock, so let's wait for a while and then try again
            if (CurrentLeader == null) {
                // no leader, so let's find out who is the current leader
                var others = Participants.GetOtherPeers(Cluster.CurrentPeer);
                Logger.LogDebug($"I am sending ASK LEADER message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
                try {
                    await Cluster.Channel.SendMessageAsync(new AskLeaderRequest {
                        SenderAddress = Cluster.CurrentPeer.Address.ToParsableString()
                    }, others, cancellationToken: cancellationToken);
                } catch (OperationCanceledException) {
                } catch (Exception ex) {
                    Logger.LogError(ex, $"I could not send ASK LEADER message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
                }
            }

            try {
                await Task.Delay(TimeSpan.FromMinutes(5), cancellationToken);
            } catch (Exception) {
                break;
            }
        }
    }
}

When this method runs, it blocks on handle.HandleLostToken for very very long time (i guess indefinitely).

But when I access that property on another thread like this, it works like a charm.

using (CommonHelper.SuppressAllScopes()) {
    Task.Run(() => {
        // register for handle loss
        if (handle.HandleLostToken.CanBeCanceled) {
            handle.HandleLostToken.Register(() => {
                _lockHandle = null;

                // lock connection lost, lease expired etc
                // that means we are no longer leader
                NotifyLeaderChanged(null);

                StartElection("Lock handle lost");

                // get rid of handle
                try {
                    handle.Dispose();
                } catch { }
            });
        }
    }).Ignore();
}

And also, when app shuts down, when I call _lockHandle.DisposeAsync method in another method, it either blocks or succeeds depending on whether I accessed _lockHandle.HandleLostToken property. In the first section of code (where I didn't access _lockHandle.HandleLostToken) _lockHandle.DisposeAsync method works without blocking, but in the second section of code where I'm accessing _lockHandle.HandleLostToken, it blocks.

I think there is some kind of synchronization context dead lock or something here. Thanks

The text was updated successfully, but these errors were encountered:

madelson · 2021-04-21T02:18:42Z

Thanks for reporting @joshua-mng. Can you show the code used to create _distributedLockManager? What underlying provider is this (SQL, Redis, etc) and what options do you use?

joshua-mng · 2021-04-21T05:25:20Z

It is SqlServer provider. I am just passing my connection string to build the provider. No extra options are provided.

new SqlDistributedSynchronizationProvider("Data Source=(localdb)\\MSSQLLocalDB;Initial Catalog=DBName;Integrated Security=True;MultipleActiveResultSets=True;")

madelson · 2021-04-21T11:31:18Z

Ok great I've been able to reproduce this:

        [Test]
        public async Task TestHandleLostTokenWhileLocked()
        {
            var lockMan = new SqlDistributedSynchronizationProvider(TestingSqlServerDb.ConnectionString);
            var @lock = lockMan.CreateLock(Guid.NewGuid().ToString());
            var handle = await @lock.TryAcquireAsync();
            if (handle != null)
            {
                var t = Task.Run(() =>
                {
                    if (handle.HandleLostToken.CanBeCanceled)
                    {
                        handle.HandleLostToken.Register(() =>
                        {
                            Console.WriteLine("Triggered");
                            handle.Dispose();
                        });
                    }
                });
                Assert.IsTrue(t.Wait(TimeSpan.FromSeconds(5))); // fails
            }
        }

madelson · 2021-04-24T17:06:56Z

I've tracked down the root cause for this and should have a fix soon.

The sequence of events is:

With Keepalive enabled (on by default for SQL Server by accidentally disabled for the relevant tests), the background keepalive/monitor loop is sleeping waiting to do its next keepalive ping.
When the HandleLostToken is accessed, it cancels the keepalive loop sleep. However, this means that the loop picks up on the Cancel() thread. The loop then tries to do monitoring which runs a WAITFOR query until the next cancel. Normally this would be an async operation, but the problem is that since we're still mid-cancellation the cancellation token looks canceled already so we exit immediately and do another loop, all without leaving the cancel thread. We end up looping indefinitely because nothing can change until the Cancel() finishes and the Cancel() can't finish because the background loop has stolen its thread.

Make all SqlServer tests use the default keepalive behavior by default.

madelson added the bug label Apr 21, 2021

madelson added this to the 2.0.2 milestone Apr 24, 2021

madelson added a commit that referenced this issue Apr 24, 2021

Fix #85 and fix #61.

99afbe4

Make all SqlServer tests use the default keepalive behavior by default.

madelson closed this as completed Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

joshua-mng commented Apr 20, 2021 •

edited

madelson commented Apr 21, 2021

joshua-mng commented Apr 21, 2021

madelson commented Apr 21, 2021

madelson commented Apr 24, 2021

SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

Comments

joshua-mng commented Apr 20, 2021 • edited

madelson commented Apr 21, 2021

joshua-mng commented Apr 21, 2021

madelson commented Apr 21, 2021

madelson commented Apr 24, 2021

joshua-mng commented Apr 20, 2021 •

edited