Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlDistributedLockHandle.HandleLostToken blocks current thread indefinitely-ish #85

Closed
joshua-mng opened this issue Apr 20, 2021 · 4 comments
Labels
Milestone

Comments

@joshua-mng
Copy link

joshua-mng commented Apr 20, 2021

I am using your library to implement a leader election algorithm, and it's working so well, thanks for the wonderful work.

Here is a certain section of my code:

protected override async Task OnElectAsync(CancellationToken cancellationToken)
{
    var dlock = _distributedLockManager.CreateLock($"{typeof(DistributedLockLeaderElectionProvider).FullName}/leader/8E7B4AC6-6C08-4467-801A-AA0AA76533AC/leader.lock");

    while (!cancellationToken.IsCancellationRequested) {
        var handle = await dlock.TryAcquireAsync(cancellationToken: cancellationToken);
        if (handle != null) {
            // save for later disposal when the current node stops or we lose the lock
            _lockHandle = handle;

            // register for handle loss
            if (handle.HandleLostToken.CanBeCanceled) {
                handle.HandleLostToken.Register(() => {
                    _lockHandle = null;

                    // lock connection lost, lease expired etc
                    // that means we are no longer leader
                    NotifyLeaderChanged(null);

                    StartElection("Lock handle lost");

                    // get rid of handle
                    try {
                        handle.Dispose();
                    } catch { }
                });
            }

            // lock acquired
            Logger.LogInformation($"Successfully aqcuired distributed leader lock, Instance Id={Constants.InstanceId.ToString("N")}, Instance Name={CommonHelper.GetAppFlag<string>(Constants.AppFlags.InstanceName)}");

            var others = Participants.GetOtherPeers(Cluster.CurrentPeer);
            Logger.LogDebug($"I am sending LEADER CHANGED message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
            try {
                await Cluster.Channel.SendMessageAsync(new LeaderChangedMessage {
                    LeaderAddress = Cluster.CurrentPeer.Address.ToParsableString()
                }, others, cancellationToken: cancellationToken);
            } catch (OperationCanceledException) {
            } catch (Exception ex) {
                Logger.LogError(ex, $"I could not send LEADER CHANGED message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
            }

            NotifyLeaderChanged(Cluster.CurrentPeer);

            // no need to loop, we acquired lock, we will keep it until shutdown
            break;
        } else {
            // couldn't acquire lock, so let's wait for a while and then try again
            if (CurrentLeader == null) {
                // no leader, so let's find out who is the current leader
                var others = Participants.GetOtherPeers(Cluster.CurrentPeer);
                Logger.LogDebug($"I am sending ASK LEADER message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
                try {
                    await Cluster.Channel.SendMessageAsync(new AskLeaderRequest {
                        SenderAddress = Cluster.CurrentPeer.Address.ToParsableString()
                    }, others, cancellationToken: cancellationToken);
                } catch (OperationCanceledException) {
                } catch (Exception ex) {
                    Logger.LogError(ex, $"I could not send ASK LEADER message to Count={others.Count()} List={Utils.EnumerableToString(others)} other peers ME={Cluster.CurrentPeer}");
                }
            }

            try {
                await Task.Delay(TimeSpan.FromMinutes(5), cancellationToken);
            } catch (Exception) {
                break;
            }
        }
    }
}

When this method runs, it blocks on handle.HandleLostToken for very very long time (i guess indefinitely).

But when I access that property on another thread like this, it works like a charm.

using (CommonHelper.SuppressAllScopes()) {
    Task.Run(() => {
        // register for handle loss
        if (handle.HandleLostToken.CanBeCanceled) {
            handle.HandleLostToken.Register(() => {
                _lockHandle = null;

                // lock connection lost, lease expired etc
                // that means we are no longer leader
                NotifyLeaderChanged(null);

                StartElection("Lock handle lost");

                // get rid of handle
                try {
                    handle.Dispose();
                } catch { }
            });
        }
    }).Ignore();
}

And also, when app shuts down, when I call _lockHandle.DisposeAsync method in another method, it either blocks or succeeds depending on whether I accessed _lockHandle.HandleLostToken property. In the first section of code (where I didn't access _lockHandle.HandleLostToken) _lockHandle.DisposeAsync method works without blocking, but in the second section of code where I'm accessing _lockHandle.HandleLostToken, it blocks.

I think there is some kind of synchronization context dead lock or something here. Thanks

@madelson
Copy link
Owner

Thanks for reporting @joshua-mng. Can you show the code used to create _distributedLockManager? What underlying provider is this (SQL, Redis, etc) and what options do you use?

@joshua-mng
Copy link
Author

It is SqlServer provider. I am just passing my connection string to build the provider. No extra options are provided.

new SqlDistributedSynchronizationProvider("Data Source=(localdb)\\MSSQLLocalDB;Initial Catalog=DBName;Integrated Security=True;MultipleActiveResultSets=True;")

@madelson madelson added the bug label Apr 21, 2021
@madelson
Copy link
Owner

Ok great I've been able to reproduce this:

        [Test]
        public async Task TestHandleLostTokenWhileLocked()
        {
            var lockMan = new SqlDistributedSynchronizationProvider(TestingSqlServerDb.ConnectionString);
            var @lock = lockMan.CreateLock(Guid.NewGuid().ToString());
            var handle = await @lock.TryAcquireAsync();
            if (handle != null)
            {
                var t = Task.Run(() =>
                {
                    if (handle.HandleLostToken.CanBeCanceled)
                    {
                        handle.HandleLostToken.Register(() =>
                        {
                            Console.WriteLine("Triggered");
                            handle.Dispose();
                        });
                    }
                });
                Assert.IsTrue(t.Wait(TimeSpan.FromSeconds(5))); // fails
            }
        }

@madelson madelson added this to the 2.0.2 milestone Apr 24, 2021
@madelson
Copy link
Owner

I've tracked down the root cause for this and should have a fix soon.

The sequence of events is:

  • With Keepalive enabled (on by default for SQL Server by accidentally disabled for the relevant tests), the background keepalive/monitor loop is sleeping waiting to do its next keepalive ping.
  • When the HandleLostToken is accessed, it cancels the keepalive loop sleep. However, this means that the loop picks up on the Cancel() thread. The loop then tries to do monitoring which runs a WAITFOR query until the next cancel. Normally this would be an async operation, but the problem is that since we're still mid-cancellation the cancellation token looks canceled already so we exit immediately and do another loop, all without leaving the cancel thread. We end up looping indefinitely because nothing can change until the Cancel() finishes and the Cancel() can't finish because the background loop has stolen its thread.

madelson added a commit that referenced this issue Apr 24, 2021
Make all SqlServer tests use the default keepalive behavior by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants