New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization for avoiding concurrent connection use with TransactionScope #1611

Open
roji opened this Issue Jun 19, 2017 · 8 comments

Comments

Projects
None yet
2 participants
@roji
Member

roji commented Jun 19, 2017

We need to address to situations:

  1. TransactionScope can trigger a timer-based rollback, which can happen at any point and coincide with a user action on the same connection. There's a flawed attempt to handle this in VolatileResourceManager.RollbackLocal() - but if the rollback happens to start first and the user action is the one that triggers the exception, it fails.
  2. It is possible for the MSDTC to perform a 2-phase rollback while skipping the Prepare phase. Since MSDTC performs the 2nd phase asynchronously, this rollback can happen at any time after the TransactionScope is disposed. See #1571 and #1610.

The only solution to these appears to be an internal synchronization mechanism which would make the user action hang until the rollback completes (not the other way around).

/cc @fredericDelaporte

@roji roji added the enhancement label Jun 19, 2017

@roji roji added this to the 3.3 milestone Jun 19, 2017

@roji roji self-assigned this Jun 19, 2017

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jun 19, 2017

Contributor
  • '2. does not looks hard to synchronize. Just locking the EnlistTransaction could be enough, if we mandates users to explicitly enlist with null if they wish to use a connection outside of a transaction after a rollback.

  • '1. would likely require to lock the whole connection interface. But is it really required? After all, the user is already in a failure case. The TransactionScope.Dispose is supposed to throw a transaction aborted exception with an inner "Transaction timeout". The user is normally still inside the scope, so failing here is not much an issue, that failure will normally be swallowed by the transaction scope own failure. So the attempt to fix the concurrency is not so flawed, but it maybe could be better. At first failed attempt, it seems to cancel the ongoing work (good IMO, since the transaction is to be aborted) but then sleeps half a second. Most likely the user code will go on first, and may re-use the connection. Ideally the cancel should have a continuation task for rollbacking, ensuring the rollback has maximal chances to get done.
    Now maybe you have another issue: too early un-enlistment from transaction. When that timeout rollback is done, the volatile ressource is disposed, setting the transaction on the connection to null. So you have an usable connection, no more enlisted with the transaction, while the user code is potentially still inside the scope... This could cause the code to perform changes outside of any transaction, while the user will not at all expect this.
    Solving that additional point would require to cease un-enlisting from the volatile ressource, at least on rollback cases. The user would be mandated to do the un-enlistment himself after a rollback-ed scope. (Either thanks to #1580 or by enlisting directly with a new transaction.)
    So inside the scope after a timeout, the user may try to use the connection but with an inactive transaction, and that should fail.

Contributor

fredericDelaporte commented Jun 19, 2017

  • '2. does not looks hard to synchronize. Just locking the EnlistTransaction could be enough, if we mandates users to explicitly enlist with null if they wish to use a connection outside of a transaction after a rollback.

  • '1. would likely require to lock the whole connection interface. But is it really required? After all, the user is already in a failure case. The TransactionScope.Dispose is supposed to throw a transaction aborted exception with an inner "Transaction timeout". The user is normally still inside the scope, so failing here is not much an issue, that failure will normally be swallowed by the transaction scope own failure. So the attempt to fix the concurrency is not so flawed, but it maybe could be better. At first failed attempt, it seems to cancel the ongoing work (good IMO, since the transaction is to be aborted) but then sleeps half a second. Most likely the user code will go on first, and may re-use the connection. Ideally the cancel should have a continuation task for rollbacking, ensuring the rollback has maximal chances to get done.
    Now maybe you have another issue: too early un-enlistment from transaction. When that timeout rollback is done, the volatile ressource is disposed, setting the transaction on the connection to null. So you have an usable connection, no more enlisted with the transaction, while the user code is potentially still inside the scope... This could cause the code to perform changes outside of any transaction, while the user will not at all expect this.
    Solving that additional point would require to cease un-enlisting from the volatile ressource, at least on rollback cases. The user would be mandated to do the un-enlistment himself after a rollback-ed scope. (Either thanks to #1580 or by enlisting directly with a new transaction.)
    So inside the scope after a timeout, the user may try to use the connection but with an inactive transaction, and that should fail.

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jun 19, 2017

Contributor

Confirmed, following fails:

[Test]
public void TimeoutDoesNotAllowToChangeDbInScope()
{
    using (var conn = OpenConnection(ConnectionStringEnlistOff))
    {
        using (new TransactionScope(TransactionScopeOption.Required,
            TimeSpan.FromMilliseconds(10)))
        {
            conn.EnlistTransaction(Transaction.Current);
            Thread.Sleep(110);
            conn.ExecuteNonQuery(@"INSERT INTO data (name) VALUES ('test1')");
        }
    }
    AssertNumberOfRows(0);
}

We do not have an exception from the scope disposal here because we do not attempt to complete it.

Otherwise we should write it:

[Test]
public void TimeoutDoesNotAllowToChangeDbInScope()
{
    Assert.That(
        () =>
        {
            using (var conn = OpenConnection(ConnectionStringEnlistOff))
            {
                using (var scope = new TransactionScope(TransactionScopeOption.Required,
                    TimeSpan.FromMilliseconds(10)))
                {
                    conn.EnlistTransaction(Transaction.Current);
                    Thread.Sleep(110);
                    conn.ExecuteNonQuery(@"INSERT INTO data (name) VALUES ('test1')");
                    scope.Complete();
                }
            }
        },
        Throws.InstanceOf(typeof(TransactionAbortedException))
            .And.InnerException.InstanceOf(typeof(TimeoutException)));
    AssertNumberOfRows(0);
}

Beware, those two versions looks flaky. Sometimes it fails as I am expecting with current Npgsql state, other times it fails for another reason or even does not fail at all... (Usually the first run right after compilation does not fail, but after that one it does fail.)

Contributor

fredericDelaporte commented Jun 19, 2017

Confirmed, following fails:

[Test]
public void TimeoutDoesNotAllowToChangeDbInScope()
{
    using (var conn = OpenConnection(ConnectionStringEnlistOff))
    {
        using (new TransactionScope(TransactionScopeOption.Required,
            TimeSpan.FromMilliseconds(10)))
        {
            conn.EnlistTransaction(Transaction.Current);
            Thread.Sleep(110);
            conn.ExecuteNonQuery(@"INSERT INTO data (name) VALUES ('test1')");
        }
    }
    AssertNumberOfRows(0);
}

We do not have an exception from the scope disposal here because we do not attempt to complete it.

Otherwise we should write it:

[Test]
public void TimeoutDoesNotAllowToChangeDbInScope()
{
    Assert.That(
        () =>
        {
            using (var conn = OpenConnection(ConnectionStringEnlistOff))
            {
                using (var scope = new TransactionScope(TransactionScopeOption.Required,
                    TimeSpan.FromMilliseconds(10)))
                {
                    conn.EnlistTransaction(Transaction.Current);
                    Thread.Sleep(110);
                    conn.ExecuteNonQuery(@"INSERT INTO data (name) VALUES ('test1')");
                    scope.Complete();
                }
            }
        },
        Throws.InstanceOf(typeof(TransactionAbortedException))
            .And.InnerException.InstanceOf(typeof(TimeoutException)));
    AssertNumberOfRows(0);
}

Beware, those two versions looks flaky. Sometimes it fails as I am expecting with current Npgsql state, other times it fails for another reason or even does not fail at all... (Usually the first run right after compilation does not fail, but after that one it does fail.)

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 19, 2017

Member

'2. does not looks hard to synchronize. Just locking the EnlistTransaction could be enough, if we mandates users to explicitly enlist with null if they wish to use a connection outside of a transaction after a rollback.

If I understand you correctly, you're proposing that we add a special lock to NpgsqlConnection which is taken the moment a resource manager is created, and released only once that RM is disposed, right? And then EnlistTransaction would simply block on that lock being released by any previous RM. This would work, but has the disadvantage of requiring a strange EnlistTransaction(null) just to continue using the transaction (i.e. when not starting a new scope). This is odd to say the least.

I'll try to answer your second comment soon (about the timeout).

Member

roji commented Jun 19, 2017

'2. does not looks hard to synchronize. Just locking the EnlistTransaction could be enough, if we mandates users to explicitly enlist with null if they wish to use a connection outside of a transaction after a rollback.

If I understand you correctly, you're proposing that we add a special lock to NpgsqlConnection which is taken the moment a resource manager is created, and released only once that RM is disposed, right? And then EnlistTransaction would simply block on that lock being released by any previous RM. This would work, but has the disadvantage of requiring a strange EnlistTransaction(null) just to continue using the transaction (i.e. when not starting a new scope). This is odd to say the least.

I'll try to answer your second comment soon (about the timeout).

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jun 19, 2017

Contributor

This would work, but has the disadvantage of requiring a strange EnlistTransaction(null) just to continue using the transaction (i.e. when not starting a new scope).

Yes, but only for the rollback case, where the RM will not be able of clearing that lock soon enough. (It can clear it soon enough from first phase or from singlephasecommit.)

Contributor

fredericDelaporte commented Jun 19, 2017

This would work, but has the disadvantage of requiring a strange EnlistTransaction(null) just to continue using the transaction (i.e. when not starting a new scope).

Yes, but only for the rollback case, where the RM will not be able of clearing that lock soon enough. (It can clear it soon enough from first phase or from singlephasecommit.)

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jun 19, 2017

Contributor

I think it should be seen as a reasonable workaround for people exposed to those troubles. After all, they do not look to be that numerous. Those troubles appear when rollbacking truly distributed transaction, or when having long running transaction ending timed-out. The other rollback cases could still work without explicit un-enlistment from transaction, if the RM is able to detect it is rollback-ing in a case without asynchronous issues. Detecting the distributed case is easy, as shown with the new method I have added for that in tests (but the RM would have to store the transaction for checking that distributed status at the time of the rollback). For the timeout case, I have not checked anything.

Contributor

fredericDelaporte commented Jun 19, 2017

I think it should be seen as a reasonable workaround for people exposed to those troubles. After all, they do not look to be that numerous. Those troubles appear when rollbacking truly distributed transaction, or when having long running transaction ending timed-out. The other rollback cases could still work without explicit un-enlistment from transaction, if the RM is able to detect it is rollback-ing in a case without asynchronous issues. Detecting the distributed case is easy, as shown with the new method I have added for that in tests (but the RM would have to store the transaction for checking that distributed status at the time of the rollback). For the timeout case, I have not checked anything.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 20, 2017

Member

@fredericDelaporte looking at this again with today's fresh eyes...

First, for the rollback scenario where prepare is skipped, there indeed seems to be no possible solution apart from provider a special API to the user which would block on a lock until the 2nd phase actually terminates. The only comment I'd have would be to provide a better-named public method such as NpgsqlConnection.WaitForDistributedTransaction(), rather than a more cryptic EnlistTransaction(null).

But here's a more important comment... Since there's at least one case where calling this method is required to avoid race conditions, it makes sense for it to be "mandatory" when using TransactionScope (and where escalation is possible). In other words, the two fixes in #1610 should probably rolled back - if the user omits the call to WaitForDistributedTransaction(), it's actually better for them to hit the error early and often, rather than in the rare edge case of a prepare-skipping rollback (solutions that work 90% of the time are actually more dangerous than no solution at all, at least when a 100% solution does exist). This solution in effect provides us with sync 2nd phase by asking for the user's help...

One note regarding the timeout concurrency issue... The reason I said the current mechanism is flawed, is that it attempts to handle a NpgsqlOperationInProgressException by retrying later, but that of course doesn't take care of the same exception being thrown to user code... It's a matter of pure chance whether the rollback timer fires when a user query is in progress (in which case the current mechanism would work), or whether the timer fires and right after that the user attempts a new query. In the second case the user's code would throw a totally unexpected NpgsqlOperationInProgressException, which isn't good. Regardless, the "early un-enlistment" you mentioned is also an issue.

However, if we end up going with something like WaitForDistributedTransaction() to solve the async 2nd phase issues, this unfortunately doesn't help us with the timer issue, as I originally thought when opening this issue... The only solution I can see for that would, again, be a synchronization mechanism that works for all connection operations...

Let me know what you think.

Member

roji commented Jun 20, 2017

@fredericDelaporte looking at this again with today's fresh eyes...

First, for the rollback scenario where prepare is skipped, there indeed seems to be no possible solution apart from provider a special API to the user which would block on a lock until the 2nd phase actually terminates. The only comment I'd have would be to provide a better-named public method such as NpgsqlConnection.WaitForDistributedTransaction(), rather than a more cryptic EnlistTransaction(null).

But here's a more important comment... Since there's at least one case where calling this method is required to avoid race conditions, it makes sense for it to be "mandatory" when using TransactionScope (and where escalation is possible). In other words, the two fixes in #1610 should probably rolled back - if the user omits the call to WaitForDistributedTransaction(), it's actually better for them to hit the error early and often, rather than in the rare edge case of a prepare-skipping rollback (solutions that work 90% of the time are actually more dangerous than no solution at all, at least when a 100% solution does exist). This solution in effect provides us with sync 2nd phase by asking for the user's help...

One note regarding the timeout concurrency issue... The reason I said the current mechanism is flawed, is that it attempts to handle a NpgsqlOperationInProgressException by retrying later, but that of course doesn't take care of the same exception being thrown to user code... It's a matter of pure chance whether the rollback timer fires when a user query is in progress (in which case the current mechanism would work), or whether the timer fires and right after that the user attempts a new query. In the second case the user's code would throw a totally unexpected NpgsqlOperationInProgressException, which isn't good. Regardless, the "early un-enlistment" you mentioned is also an issue.

However, if we end up going with something like WaitForDistributedTransaction() to solve the async 2nd phase issues, this unfortunately doesn't help us with the timer issue, as I originally thought when opening this issue... The only solution I can see for that would, again, be a synchronization mechanism that works for all connection operations...

Let me know what you think.

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jun 20, 2017

Contributor

First, as currently there are few reports about those races condition, reverting the fix for the usual case will still keep it a low occurring failure, so that will not really help user not forgetting the synchronization call. In normal application use cases, that is probably not very usual to chain short distributed transactions on the same connection, without any processing in between.

Second, from an external library point of view, working only with the DbConnection interface, introducing a specific method is not a solution. NHibernate supports more than ten databases. Its code needs some standard way to work with them.

EnlistTransaction(null) is not that cryptic, although undocumented even in Ado.Net (but nonetheless supported by SqlConnection and OdbcConnection, and required for chaining usage with them too). It is just "leave previous transaction". Yes, for most users, most of times, they will not need it, even without the fixes of #1610. It really should be documented as required for distributed cases, when not enlisting directly another transaction, due to the async nature of distributed transactions. I know, people do not read the docs. Till they get a trouble. Or they report a trouble, and then they can just be pointed to doc. For those who gives up without reading the doc or reporting anything, so-long... The best we can do for those last is maybe to try having better exception message: instead of raising only "connection is in use" message, additionally check if there is still a "inactive" transaction not yet un-enlisted (beware of ObjectDisposedException when accessing tran info), and if yes add some guidance info in the exception message. "A transaction is still ongoing, have you try un-enlisting before re-using the connection after the scope? See http..."

About the timeout, I think I had understood the subject on which you insist, and I believe to have addressed it with my previous answer. Yes the user may get an exception due to the rollback being asynchronously executing, and I have written, that is not a trouble in my opinion. The user is already in a failure case. Even better if Npgsql early fails in its code, instead of silently performing operations outside of any transaction as I demonstrate with above test cases. The fix I propose is even to get the connection systematically failing till the user un-enlist the timed-out transaction. (Granted, I have not written it that explicitly.) And of course, for being consistent, he should un-enlist it only after the failed scope disposal.

Contributor

fredericDelaporte commented Jun 20, 2017

First, as currently there are few reports about those races condition, reverting the fix for the usual case will still keep it a low occurring failure, so that will not really help user not forgetting the synchronization call. In normal application use cases, that is probably not very usual to chain short distributed transactions on the same connection, without any processing in between.

Second, from an external library point of view, working only with the DbConnection interface, introducing a specific method is not a solution. NHibernate supports more than ten databases. Its code needs some standard way to work with them.

EnlistTransaction(null) is not that cryptic, although undocumented even in Ado.Net (but nonetheless supported by SqlConnection and OdbcConnection, and required for chaining usage with them too). It is just "leave previous transaction". Yes, for most users, most of times, they will not need it, even without the fixes of #1610. It really should be documented as required for distributed cases, when not enlisting directly another transaction, due to the async nature of distributed transactions. I know, people do not read the docs. Till they get a trouble. Or they report a trouble, and then they can just be pointed to doc. For those who gives up without reading the doc or reporting anything, so-long... The best we can do for those last is maybe to try having better exception message: instead of raising only "connection is in use" message, additionally check if there is still a "inactive" transaction not yet un-enlisted (beware of ObjectDisposedException when accessing tran info), and if yes add some guidance info in the exception message. "A transaction is still ongoing, have you try un-enlisting before re-using the connection after the scope? See http..."

About the timeout, I think I had understood the subject on which you insist, and I believe to have addressed it with my previous answer. Yes the user may get an exception due to the rollback being asynchronously executing, and I have written, that is not a trouble in my opinion. The user is already in a failure case. Even better if Npgsql early fails in its code, instead of silently performing operations outside of any transaction as I demonstrate with above test cases. The fix I propose is even to get the connection systematically failing till the user un-enlist the timed-out transaction. (Granted, I have not written it that explicitly.) And of course, for being consistent, he should un-enlist it only after the failed scope disposal.

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jun 22, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jun 22, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jun 22, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jun 22, 2017

@fredericDelaporte

This comment has been minimized.

Show comment
Hide comment
@fredericDelaporte

fredericDelaporte Jul 2, 2017

Contributor

About the time-out rollback trouble, for your information, it appears another case could be similar and cause issues: failures server side. Experimenting with NH-3023, a SQL-Server deadlock causes premature transaction end too, before its scope disposal. I have seen that thanks to logs added in transaction completion events.
If the user catches the server failure and try re-use the connection while still in the scope, it should receive an error such as "transaction not in a valid state". SqlConnection does so. I fear that like for the timeout case, Npgsql will not and will issue the user command un-transacted instead. That is a bit less bad than the timeout case, since it requires the user to explicitly catch an exception in a scope, to ignore it and to try still using the connection in the same scope. This is likely not a common pattern. This is even obviously a wrong pattern in case of deadlock, since the transaction is aborted on server side.

Contributor

fredericDelaporte commented Jul 2, 2017

About the time-out rollback trouble, for your information, it appears another case could be similar and cause issues: failures server side. Experimenting with NH-3023, a SQL-Server deadlock causes premature transaction end too, before its scope disposal. I have seen that thanks to logs added in transaction completion events.
If the user catches the server failure and try re-use the connection while still in the scope, it should receive an error such as "transaction not in a valid state". SqlConnection does so. I fear that like for the timeout case, Npgsql will not and will issue the user command un-transacted instead. That is a bit less bad than the timeout case, since it requires the user to explicitly catch an exception in a scope, to ignore it and to try still using the connection in the same scope. This is likely not a common pattern. This is even obviously a wrong pattern in case of deadlock, since the transaction is aborted on server side.

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jul 24, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jul 24, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jul 24, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jul 24, 2017

fredericDelaporte added a commit to fredericDelaporte/npgsql that referenced this issue Jul 24, 2017

@roji roji modified the milestones: 4.0, 4.1 Mar 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment