Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed transaction fixes #1610

Merged
merged 3 commits into from Jun 19, 2017
Merged

Distributed transaction fixes #1610

merged 3 commits into from Jun 19, 2017

Conversation

roji
Copy link
Member

@roji roji commented Jun 19, 2017

@fredericDelaporte, this PR contains fixes for #1571 and #1580. Can I ask you to take a look and confirm that things look good for you?

I would ideally like to see your race condition tests passing after these commits, but that currently is possible because of #1609 (comment).

Let me know what you think.

roji added 2 commits June 19, 2017 10:54
* If the user connection is still open when performing 2PC, perform the
  operation over a new connection to avoid concurrent connection use
  do to the asynchronous nature of the MSDTC.
* Clear the connection's EnlistedTransaction at the end of the 1st phase
  to allow the connection to reenlist with another transaction, rather
  than waiting for the end of the 2nd phase (which again is async).

Fixes #1571
Copy link
Contributor

@fredericDelaporte fredericDelaporte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit case looks good to me. But I expect the rollback case to still have issues. I will add the required tests in my PR.

@@ -506,11 +506,13 @@ public override void EnlistTransaction(Transaction transaction)
if (transaction == null)
throw new ArgumentNullException(nameof(transaction));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still forbids #1580.

#1580 may still be required for re-using a connection without a transaction after a rollback: a rollback may not go through prepare phase, but directly to the second phase rollback. This will cause the connection to be still enlisted with the transaction. #1580 would allow the user to manually force un-enlistment.
(It is maybe unneeded for Npgsql if connection checks the transaction status and does not use it if no more active, but it would then have to guard against disposed transaction no more allowing to check their status.)

But well, I think another trouble will then happen: in this case, the rollback phase still need to use the connection while executing asynchronously, so we still have a concurrency issue there.

if (EnlistedTransaction != null)
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
{
if (transaction.Equals(EnlistedTransaction))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better check EnlistedTransaction.Equals(transaction), if removing the null check on transaction.

return;
if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active)
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})");
}

var connector = CheckReadyAndGetConnector();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code below there is

EnlistedTransaction = transaction;
transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);

If supporting #1580 (null transaction), an early exit in case of null should be added between those two lines.

@roji
Copy link
Member Author

roji commented Jun 19, 2017

@fredericDelaporte thanks for your comments, have amended the PR with your suggested changes - can you take another quick look?

@fredericDelaporte
Copy link
Contributor

fredericDelaporte commented Jun 19, 2017

Should fully fix #1580. I expect #1571 to still fail in the rollback case, even by amending the newly added test ConnectionReuseRaceAfterRollback in #1609 with a null enlistment before the connection re-use.

If that gets confirmed (should I merge your PR in mine for checking?), I think a synchronization mechanism will be mandatory.

For avoiding it to be too pervasive, it could be put only in the transaction enlistment method: if an inactive transaction still lies there, wait for it to be removed by its volatile resource manager. Npgsql documentation would then mandates to explicitly enlist with the next transaction or null if wishing to re-use the connection right after a distributed rollback.

@roji
Copy link
Member Author

roji commented Jun 19, 2017

@fredericDelaporte the rollback failure you're referring to would be due to Prepare phase step being skipped, and going to rollback directly, right? If so, I'd imagine that in that case there's a good chance rollback would be called synchronously and not asynchronously (at least it makes sense to me), the only real way to find out is to test...

Or if you're talking about some other scenario can you please elaborate?

Either way feel free to try with my PR merged.

@fredericDelaporte
Copy link
Contributor

the rollback failure you're referring to would be due to Prepare phase step being skipped, and going to rollback directly, right?

Precisely.

there's a good chance rollback would be called synchronously and not asynchronously

Unfortunately that is not the case, it is run asynchronously when the transaction is distributed.

Here is the output of queued event in the rollback test, without your PR:

Failed at iteration 2.
Events:
Scope started, connection enlisted (TId 18)
Durable resource 2: enlisted (TId 18)
Insert done (TId 18)
Scope not completed (TId 18)
Scope disposed (TId 18)
Durable resource 2: rollback phase start (TId 20)
Durable resource 2: transaction completed start (TId 21)
Durable resource 2: rollback phase, calling done (TId 20)
Durable resource 2: transaction completed middle (TId 21)
Durable resource 2: transaction completed end (TId 21)
Durable resource 2: rollback phase end (TId 20)
-------------
Start formatting event queue, going to sleep a bit for late events
------------- (TId 18)

Exception NUnit.Framework.AssertionException:   Expected: No Exception to be thrown
  But was:  <Npgsql.NpgsqlOperationInProgressException: The connection is already in state 'Executing'
   à Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState, NpgsqlCommand command)
   à Npgsql.NpgsqlConnector.StartUserAction(NpgsqlCommand command)
   à Npgsql.NpgsqlCommand.<ExecuteScalar>d__87.MoveNext()
--- Fin de la trace de la pile à partir de l'emplacement précédent au niveau duquel l'exception a été levée ---
   à System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   à System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   à System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult()
   à Npgsql.NpgsqlCommand.ExecuteScalar()
   à Npgsql.Tests.NpgsqlConnectionExtensions.ExecuteScalar(NpgsqlConnection conn, String sql, NpgsqlTransaction tx) dans E:\Projets\npgsql\test\Npgsql.Tests\TestUtil.cs:ligne 89
   à Npgsql.Tests.SystemTransactionTests.<>c__DisplayClass12_0.<ConnectionReuseRaceAfterRollback>b__0() dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 408
   à NUnit.Framework.Constraints.ThrowsConstraint.VoidInvocationDescriptor.Invoke()
   à NUnit.Framework.Constraints.ThrowsConstraint.ExceptionInterceptor.Intercept(Object invocation)>

   à NUnit.Framework.Assert.ReportFailure(String message)
   à NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
   à NUnit.Framework.Assert.DoesNotThrow(TestDelegate code)
   à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceAfterRollback(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 408
à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceAfterRollback(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 413

@fredericDelaporte
Copy link
Contributor

I have just locally rebased and tested: the two other races do no more fail. But the rollback race still fail, with same output.

@fredericDelaporte
Copy link
Contributor

And with null enlistment, still same:

Failed at iteration 2.
Events:
Scope started, connection enlisted (TId 18)
Durable resource 2: enlisted (TId 18)
Insert done (TId 18)
Scope not completed (TId 18)
Scope disposed (TId 18)
Durable resource 2: rollback phase start (TId 23)
Connection enlisted with null (TId 18)
Durable resource 2: rollback phase, calling done (TId 23)
Durable resource 2: transaction completed start (TId 21)
Durable resource 2: rollback phase end (TId 23)
Durable resource 2: transaction completed middle (TId 21)
Durable resource 2: transaction completed end (TId 21)
-------------
Start formatting event queue, going to sleep a bit for late events
------------- (TId 18)

Exception NUnit.Framework.AssertionException:   Expected: No Exception to be thrown
  But was:  <Npgsql.NpgsqlOperationInProgressException: The connection is already in state 'Executing'
   à Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState, NpgsqlCommand command)
   à Npgsql.NpgsqlConnector.StartUserAction(NpgsqlCommand command)
   à Npgsql.NpgsqlCommand.<ExecuteScalar>d__87.MoveNext()
--- Fin de la trace de la pile à partir de l'emplacement précédent au niveau duquel l'exception a été levée ---
   à System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   à System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   à System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult()
   à Npgsql.NpgsqlCommand.ExecuteScalar()
   à Npgsql.Tests.NpgsqlConnectionExtensions.ExecuteScalar(NpgsqlConnection conn, String sql, NpgsqlTransaction tx) dans E:\Projets\npgsql\test\Npgsql.Tests\TestUtil.cs:ligne 89
   à Npgsql.Tests.SystemTransactionTests.<>c__DisplayClass12_0.<ConnectionReuseRaceAfterRollback>b__0() dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 409
   à NUnit.Framework.Constraints.ThrowsConstraint.VoidInvocationDescriptor.Invoke()
   à NUnit.Framework.Constraints.ThrowsConstraint.ExceptionInterceptor.Intercept(Object invocation)>

   à NUnit.Framework.Assert.ReportFailure(String message)
   à NUnit.Framework.Assert.That[TActual](TActual actual, IResolveConstraint expression, String message, Object[] args)
   à NUnit.Framework.Assert.DoesNotThrow(TestDelegate code)
   à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceAfterRollback(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 409
à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceAfterRollback(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 414

Do you rather keep #1609 distinct, or should I force push my rebasing?

@roji
Copy link
Member Author

roji commented Jun 19, 2017

OK, I merged this PR (#1610) to dev and also backported to 3.2.5. If you're done with #1609, rebase it on the latest dev and I'll merge that.

But I guess we're stuck with the rollback scenario where prepare is skipped - since Prepare didn't occur the user connection must be used to roll back. Ironically, RollbackLocal already contains a flawed attempt to handle [timer-based rollback]((https://msdn.microsoft.com/en-us/library/ms149852(v=vs.110).aspx), where the TransactionScope itself triggers a rollback after some time has elapsed, and that can also happen concurrently with user actions. If RollbackLocal gets a concurrent connection use exception while trying to roll back, it waits and tries again later - but of course it's possible that the user action will trigger the exception instead, which is what you seem to be seeing...

There doesn't seem to be any possible solution here against internal Npgsql synchronization. I've opened #1611 to track this - it probably doesn't make sense to include something like this in a patch-level release so it'll be for 3.3. Let me know if anything is inaccurate or if you think of something we've missed.

(thanks again for all the help!)

@fredericDelaporte
Copy link
Contributor

Done with #1609, rebased on dev. The race tests stays explicit, maybe do you wish them to be included in build test. Of course the rollback race still fails, so this one must stay explicit for now.

@roji
Copy link
Member Author

roji commented Jun 19, 2017

@fredericDelaporte good idea - the race tests don't take that much time (a few seconds), can you please remove the explicit from them (except the rollback)?

One last thought about the problematic rollback scenario... It's true that we can avoid the concurrent connection issue via a new synchronization mechanism (#1611). However, that would still leave the enlistment clearing problem - if a user attempts to enlist the same connection to a new TransactionScope, there will still be a race condition as we can't know when exactly the EnlistedTransaction will be cleared.

Another, even more subtle race condition probably exists, by the way, if the user closes the connection right after disposing the TransactionScope - TransactionScope.Dispose and NpgsqlConnection.Close could run in parallel. I haven't really dived into what possible bad consequences that could have, but it can't be good.

Amazing how much mayhem a bit of asynchronicity can produce...

@roji
Copy link
Member Author

roji commented Jun 19, 2017

Forget the above comment about the rollback - the additional locking in EnlistTransaction which you proposed could make EnlistTransaction simply block until the previous RM is disposed.

@fredericDelaporte
Copy link
Contributor

Explicit removed. Unfortunately I have just got an error with ConnectionReuseRaceChainingTransaction I am not able to reproduce. The console output was:

Failed at iteration 74.
Events:
First scope started (TId 18)
-------------
Start formatting event queue, going to sleep a bit for late events
------------- (TId 18)

Exception Npgsql.NpgsqlOperationInProgressException: The connection is already in state 'Executing'
   à Npgsql.NpgsqlConnector.StartUserAction(ConnectorState newState, NpgsqlCommand command)
   à Npgsql.NpgsqlConnection.BeginTransaction(IsolationLevel level)
   à Npgsql.VolatileResourceManager..ctor(NpgsqlConnection connection, Transaction transaction)
   à Npgsql.NpgsqlConnection.EnlistTransaction(Transaction transaction)
   à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceChainingTransaction(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 437
à Npgsql.Tests.SystemTransactionTests.ConnectionReuseRaceChainingTransaction(Boolean distributed) dans E:\Projets\npgsql\test\Npgsql.Tests\SystemTransactionTests.cs:ligne 487

This was a failure at the very beginning of the iteration, at the first transaction enlistment, right after having got the connection. As if the pool had yield a connection which was still in use.

About the Close concurrency, maybe could it be solved by starting the close by an EnlistTransaction(null) and the locking in case of previous RM still there.

For the locking, I think the connection should store a reference on the RM. The RM should expose a Wait method which would be called with a ?. (it is thread safe) before enlisting with a new transaction or null, after having checked pre-conditions (current transaction no more active). That wait would use preferably a SemaphoreSlim for locking till disposal. Disposal would clear the RM reference from the connection and release the semaphore.

Now for addressing the last rollback trouble, the RM disposal should no more set to null the transaction in case of timeout rollback (better still do it for other cases, but not sure if you can recognize timeout from other cases though). That would be up to the user to change enlistment after a rollback. And better wrap the previous transaction status check in EnlistTransaction with a catch(DisposedObjectException), which would then consider the previous transaction as no more active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants