New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed transaction fixes #1610
Conversation
* If the user connection is still open when performing 2PC, perform the operation over a new connection to avoid concurrent connection use do to the asynchronous nature of the MSDTC. * Clear the connection's EnlistedTransaction at the end of the 1st phase to allow the connection to reenlist with another transaction, rather than waiting for the end of the 2nd phase (which again is async). Fixes #1571
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit case looks good to me. But I expect the rollback case to still have issues. I will add the required tests in my PR.
src/Npgsql/NpgsqlConnection.cs
Outdated
@@ -506,11 +506,13 @@ public override void EnlistTransaction(Transaction transaction) | |||
if (transaction == null) | |||
throw new ArgumentNullException(nameof(transaction)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still forbids #1580.
#1580 may still be required for re-using a connection without a transaction after a rollback: a rollback may not go through prepare phase, but directly to the second phase rollback. This will cause the connection to be still enlisted with the transaction. #1580 would allow the user to manually force un-enlistment.
(It is maybe unneeded for Npgsql if connection checks the transaction status and does not use it if no more active, but it would then have to guard against disposed transaction no more allowing to check their status.)
But well, I think another trouble will then happen: in this case, the rollback phase still need to use the connection while executing asynchronously, so we still have a concurrency issue there.
src/Npgsql/NpgsqlConnection.cs
Outdated
if (EnlistedTransaction != null) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})"); | ||
{ | ||
if (transaction.Equals(EnlistedTransaction)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better check EnlistedTransaction.Equals(transaction)
, if removing the null check on transaction
.
src/Npgsql/NpgsqlConnection.cs
Outdated
return; | ||
if (EnlistedTransaction.TransactionInformation.Status == System.Transactions.TransactionStatus.Active) | ||
throw new InvalidOperationException($"Already enlisted to transaction (localid={EnlistedTransaction.TransactionInformation.LocalIdentifier})"); | ||
} | ||
|
||
var connector = CheckReadyAndGetConnector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code below there is
EnlistedTransaction = transaction;
transaction.EnlistVolatile(new VolatileResourceManager(this, transaction), EnlistmentOptions.None);
If supporting #1580 (null transaction), an early exit in case of null should be added between those two lines.
11f9be4
to
11e5e8f
Compare
@fredericDelaporte thanks for your comments, have amended the PR with your suggested changes - can you take another quick look? |
Should fully fix #1580. I expect #1571 to still fail in the rollback case, even by amending the newly added test If that gets confirmed (should I merge your PR in mine for checking?), I think a synchronization mechanism will be mandatory. For avoiding it to be too pervasive, it could be put only in the transaction enlistment method: if an inactive transaction still lies there, wait for it to be removed by its volatile resource manager. Npgsql documentation would then mandates to explicitly enlist with the next transaction or null if wishing to re-use the connection right after a distributed rollback. |
@fredericDelaporte the rollback failure you're referring to would be due to Prepare phase step being skipped, and going to rollback directly, right? If so, I'd imagine that in that case there's a good chance rollback would be called synchronously and not asynchronously (at least it makes sense to me), the only real way to find out is to test... Or if you're talking about some other scenario can you please elaborate? Either way feel free to try with my PR merged. |
Precisely.
Unfortunately that is not the case, it is run asynchronously when the transaction is distributed. Here is the output of queued event in the rollback test, without your PR:
|
I have just locally rebased and tested: the two other races do no more fail. But the rollback race still fail, with same output. |
And with null enlistment, still same:
Do you rather keep #1609 distinct, or should I force push my rebasing? |
OK, I merged this PR (#1610) to dev and also backported to 3.2.5. If you're done with #1609, rebase it on the latest dev and I'll merge that. But I guess we're stuck with the rollback scenario where prepare is skipped - since Prepare didn't occur the user connection must be used to roll back. Ironically, RollbackLocal already contains a flawed attempt to handle [timer-based rollback]((https://msdn.microsoft.com/en-us/library/ms149852(v=vs.110).aspx), where the TransactionScope itself triggers a rollback after some time has elapsed, and that can also happen concurrently with user actions. If RollbackLocal gets a concurrent connection use exception while trying to roll back, it waits and tries again later - but of course it's possible that the user action will trigger the exception instead, which is what you seem to be seeing... There doesn't seem to be any possible solution here against internal Npgsql synchronization. I've opened #1611 to track this - it probably doesn't make sense to include something like this in a patch-level release so it'll be for 3.3. Let me know if anything is inaccurate or if you think of something we've missed. (thanks again for all the help!) |
Done with #1609, rebased on dev. The race tests stays explicit, maybe do you wish them to be included in build test. Of course the rollback race still fails, so this one must stay explicit for now. |
@fredericDelaporte good idea - the race tests don't take that much time (a few seconds), can you please remove the explicit from them (except the rollback)?
Another, even more subtle race condition probably exists, by the way, if the user closes the connection right after disposing the TransactionScope - TransactionScope.Dispose and NpgsqlConnection.Close could run in parallel. I haven't really dived into what possible bad consequences that could have, but it can't be good. Amazing how much mayhem a bit of asynchronicity can produce... |
Forget the above comment about the rollback - the additional locking in EnlistTransaction which you proposed could make EnlistTransaction simply block until the previous RM is disposed. |
Explicit removed. Unfortunately I have just got an error with
This was a failure at the very beginning of the iteration, at the first transaction enlistment, right after having got the connection. As if the pool had yield a connection which was still in use. About the For the locking, I think the connection should store a reference on the RM. The RM should expose a Now for addressing the last rollback trouble, the RM disposal should no more set to |
@fredericDelaporte, this PR contains fixes for #1571 and #1580. Can I ask you to take a look and confirm that things look good for you?
I would ideally like to see your race condition tests passing after these commits, but that currently is possible because of #1609 (comment).
Let me know what you think.