Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsoft SQL Server Resource Adaptor(JDBC Driver) returning XAException with error code XAER_RMERR in case of fail-over of SQL Server AG from primary node to standby node while committing an XA Transaction #1432

Closed
pa1484 opened this issue Sep 17, 2020 · 13 comments
Projects

Comments

@pa1484
Copy link

pa1484 commented Sep 17, 2020

Driver version

SQL Server JDBC Driver 7.0 & 8.2

SQL Server version

SQL Server AG 2017

Client Operating System

Windows Server 2016, version 10.0

JAVA/JVM version

IBM J9 VM

Table schema

Problem description

During a fail-over test for one of the product when using SQL server AG as back end, we found that there are few XA transactions in SQL Server DTC are in in-doubt state.

After analyzing the trace files, we found that the SQL Server JDBC driver was returning an XAException with an error code of XAER_RMERR on the XAResource.commit due to SQL Server AG fail-over. Websphere application server treats XAER_RMERR as a non-retriable condition as per the XA specification.

We tested with a diagnostic patch on Websphere application server to treat XAER_RMERR as a XAER_RMFAIL(which is retriable) and recreated the scenario.Due to the diagnostic patch Websphere retries the commits which are eventually successful and we no longer finding in-doubt transactions in SQL server DTC.

  1. Expected behaviour:
    SQL Server Resource Adaptor(JDBC Driver) should return an XAException with a retriable error code such as XAER_RMFAIL in case fail-over of SQL Server AG from primary node to standby node while committing an XA Transaction.
  2. Actual behaviour:
    SQL Server Resource Adaptor(JDBC Driver) returning XAException with error code XAER_RMERR in case fail-over of SQL Server AG from primary node to standby node while committing an XA Transaction
  3. Error message/stack trace:
    Below is the trace that shows XAException with XAER_RMERR error code returned by SQLServer JDBC driver during fail-over.

[7/30/20 17:20:37:254 CST] 00000369 WSRdbXaResour > commit Entry
com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl@5e889f7e
WSRdbManagedConnectionImpl@3a76217e
com.ibm.ws.tx.jta.XidImpl@e2696786
{XidImpl: formatId(57415344), gtrid_length(36), bqual_length(54),
data(000001739f05fd25000000017aa729486c90e9918373568c6d645a1d91c7913943fe1a86000001739f05fd25000000017aa729486c90e9918373568c6d645a1d91c7913943fe1a86000000010000000000000000000000000001)}
Global Transaction ID: 00 00 01 73 9f 05 fd 25 00 00 00 01 7a a7 29 48 6c 90 e9 91 83 73 56 8c 6d 64 5a 1d 91 c7 91 39 43 fe 1a 86
Branch Qualifier: 00 00 01 73 9f 05 fd 25 00 00 00 01 7a a7 29 48 6c 90 e9 91 83 73 56 8c 6d 64 5a 1d 91 c7 91 39 43 fe 1a 86 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01
Format ID: 1463898948
TWO PHASE

[7/30/20 17:20:37:286 CST] 00000369 WSRdbXaResour E DSRA0302E: XAException occurred. Error code is: XAER_RMERR (-3). Exception is: com.microsoft.sqlserver.jdbc.SQLServerException: SQL Server did not return a response. The connection has been closed.
[7/30/20 17:20:37:286 CST] 00000369 WSRdbXaResour 3 XAException was not a connection error for XAResource com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl@5e889f7e
[7/30/20 17:20:37:286 CST] 00000369 WSRdbXaResour < commit Exit
javax.transaction.xa.XAException: com.microsoft.sqlserver.jdbc.SQLServerException: SQL Server did not return a response. The connection has been closed.
at com.microsoft.sqlserver.jdbc.SQLServerXAResource.DTC_XA_Interface(SQLServerXAResource.java:738)
at com.microsoft.sqlserver.jdbc.SQLServerXAResource.commit(SQLServerXAResource.java:807)
at com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl.commit(WSRdbXaResourceImpl.java:361)
at com.ibm.ejs.j2c.XATransactionWrapper.commit(XATransactionWrapper.java:496)
at com.ibm.tx.jta.impl.JTAXAResourceImpl.commit(JTAXAResourceImpl.java:308)
at com.ibm.tx.jta.impl.RegisteredResources.deliverOutcome(RegisteredResources.java:1644)
at com.ibm.tx.jta.impl.RegisteredResources.distributeOutcome(RegisteredResources.java:2003)
at com.ibm.tx.jta.impl.RegisteredResources.distributeCommit(RegisteredResources.java:2315)
at com.ibm.tx.jta.impl.TransactionImpl.internalCommit(TransactionImpl.java:1879)
at com.ibm.tx.jta.impl.TransactionImpl.internalCommit(TransactionImpl.java:1855)
at com.ibm.tx.jta.impl.TransactionImpl.coreStage2CommitProcessing(TransactionImpl.java:1135)
at com.ibm.tx.jta.impl.TransactionImpl.stage2CommitProcessing(TransactionImpl.java:1173)
at com.ibm.tx.jta.impl.TransactionImpl.processCommit(TransactionImpl.java:1034)
at com.ibm.tx.jta.impl.TransactionImpl.commit(TransactionImpl.java:964)
at com.ibm.ws.tx.jta.TranManagerImpl.commit(TranManagerImpl.java:439)
at com.ibm.tx.jta.impl.TranManagerSet.commit(TranManagerSet.java:191)
at com.ibm.ws.uow.UOWManagerImpl.uowCommit(UOWManagerImpl.java:807)
.....
.....
.....

Below is the trace where XAER_RMERR error code is converted to XAER_RMFAIL error code, later retried and committed successfully on a different thread.

[8/3/20 13:49:45:363 CST] 0000026e WSRdbXaResour > commit Entry
com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl@1909f879
WSRdbManagedConnectionImpl@85bb6fef
com.ibm.ws.tx.jta.XidImpl@aad4067c
{XidImpl: formatId(57415344), gtrid_length(36), bqual_length(54),
data(00000173b2de55fe000000014b323c4c6b4650ebd3cd42e87a42c9c60b9b3d3d64a9b46a00000173b2de55fe000000014b323c4c6b4650ebd3cd42e87a42c9c60b9b3d3d64a9b46a000000010000000000000000000000000001)}
Global Transaction ID: 00 00 01 73 b2 de 55 fe 00 00 00 01 4b 32 3c 4c 6b 46 50 eb d3 cd 42 e8 7a 42 c9 c6 0b 9b 3d 3d 64 a9 b4 6a
Branch Qualifier: 00 00 01 73 b2 de 55 fe 00 00 00 01 4b 32 3c 4c 6b 46 50 eb d3 cd 42 e8 7a 42 c9 c6 0b 9b 3d 3d 64 a9 b4 6a 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01
Format ID: 1463898948
TWO PHASE
[8/3/20 13:49:45:363 CST] 0000026e WSRdbXaResour W Converting XAER_RMERR to XAER_RMFAIL

[8/3/20 13:49:45:379 CST] 0000026e FfdcProvider W com.ibm.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I: FFDC Incident emitted on C:\IBM\BAW\profiles\Node1Profile\logs\ffdc\AppClusterMember1_ad159ec9_20.08.03_13.49.45.3635376806510343302270.txt com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl.commit 126

[8/3/20 13:49:45:379 CST] 0000026e WSRdbXaResour E DSRA0302E: XAException occurred. Error code is: XAER_RMFAIL (-7). Exception is: com.microsoft.sqlserver.jdbc.SQLServerException: SQL Server did not return a response. The connection has been closed.
.....
.....
[8/3/20 13:49:45:520 CST] 0000026e WSRdbXaResour < commit Exit
javax.transaction.xa.XAException
at com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl.commit(WSRdbXaResourceImpl.java:386)
at com.ibm.ejs.j2c.XATransactionWrapper.commit(XATransactionWrapper.java:496)
at com.ibm.tx.jta.impl.JTAXAResourceImpl.commit(JTAXAResourceImpl.java:308)
at com.ibm.tx.jta.impl.RegisteredResources.deliverOutcome(RegisteredResources.java:1644)
at com.ibm.tx.jta.impl.RegisteredResources.distributeOutcome(RegisteredResources.java:2003)
at com.ibm.tx.jta.impl.RegisteredResources.distributeCommit(RegisteredResources.java:2315)
at com.ibm.tx.jta.impl.TransactionImpl.internalCommit(TransactionImpl.java:1879)
at com.ibm.tx.jta.impl.TransactionImpl.internalCommit(TransactionImpl.java:1855)
at com.ibm.tx.jta.impl.TransactionImpl.coreStage2CommitProcessing(TransactionImpl.java:1135)
at com.ibm.tx.jta.impl.TransactionImpl.stage2CommitProcessing(TransactionImpl.java:1173)
at com.ibm.tx.jta.impl.TransactionImpl.processCommit(TransactionImpl.java:1034)
at com.ibm.tx.jta.impl.TransactionImpl.commit(TransactionImpl.java:964)
at com.ibm.ws.tx.jta.TranManagerImpl.commit(TranManagerImpl.java:439)
at com.ibm.tx.jta.impl.TranManagerSet.commit(TranManagerSet.java:191)
at com.ibm.ws.uow.UOWManagerImpl.uowCommit(UOWManagerImpl.java:807)
....
....
....

[8/3/20 13:50:55:489 CST] 00000088 WSRdbXaResour > commit Entry
com.ibm.ws.rsadapter.spi.WSRdbXaResourceImpl@e862f1d8
WSRdbManagedConnectionImpl@7ab88f11
com.ibm.ws.tx.jta.XidImpl@aad4067c
{XidImpl: formatId(57415344), gtrid_length(36), bqual_length(54),
data(00000173b2de55fe000000014b323c4c6b4650ebd3cd42e87a42c9c60b9b3d3d64a9b46a00000173b2de55fe000000014b323c4c6b4650ebd3cd42e87a42c9c60b9b3d3d64a9b46a000000010000000000000000000000000001)}
Global Transaction ID: 00 00 01 73 b2 de 55 fe 00 00 00 01 4b 32 3c 4c 6b 46 50 eb d3 cd 42 e8 7a 42 c9 c6 0b 9b 3d 3d 64 a9 b4 6a
Branch Qualifier: 00 00 01 73 b2 de 55 fe 00 00 00 01 4b 32 3c 4c 6b 46 50 eb d3 cd 42 e8 7a 42 c9 c6 0b 9b 3d 3d 64 a9 b4 6a 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01
Format ID: 1463898948
TWO PHASE
[8/3/20 13:50:55:489 CST] 00000088 WSRdbXaResour 3 XAResource.start was never issued; allowing commit for recovery.
[8/3/20 13:50:55:505 CST] 00000088 WSRdbXaResour < commit Exit

  1. Any other details that can be helpful:
    AG nodes are configured 'sync-commit'
    in-doubt xact resolution set to 0

JDBC trace logs

Reproduction code

@lilgreenbird
Copy link
Member

hi @pa1484

I could find anything in the XA specification that says XAER_RMERR is non-retriable?
On p40 for xa_forget it defines the return values:
[XAER_RMERR]
An error occurred in the resource manager and the resource manager has not
forgotten the transaction branch.
[XAER_RMFAIL]
An error occurred that makes the resource manager unavailable.

Similarly for javax.transaction.xa.XAException:
static int | XAER_RMERR | A resource manager error has occurred in the transaction branch.
static int | Resource manager is unavailable.

Seems both these are valid and it's up to the application to interpret what actions to take for each return code. It's great that you could patch your application to handle the XAER_RMERR code since changing the return code in the driver would be a breaking change for those applications which may be relying on the current XAER_RMERR return value.

@pa1484
Copy link
Author

pa1484 commented Sep 21, 2020

Hi,

In the reference specification page 34, the semantics for the different responses to xa_commit are listed.
For XAER_RMERR, it states:

[XAER_RMERR]
An error occurred in committing the work performed on behalf of the transaction
branch and the branch’s work has been rolled back. Note that returning this error
signals a catastrophic event to a transaction manager since other resource
managers may successfully commit their work on behalf of this branch. This error
should be returned only when a resource manager concludes that it can never
commit the branch and that it cannot hold the branch’s resources in a prepared
state. Otherwise, [XA_RETRY] should be returned.

Note that in this case the RM has rolled back the transaction after concluding that it can never be committed. From the transaction manager perspective it is not retriable.7
The issue is on xa_commit and not xa_forget.

Thanks,
Pavan Sikhakolli

@lilgreenbird
Copy link
Member

oh I see. I was going by the line numbers from your log above which is xa_forget in current version. You're right in previous version that was the line of xa_commit. Yes from the description in the specification it does look like RMFAIL would be a more appropriate return code in this scenario.

@lilgreenbird
Copy link
Member

Hi @pa1484

I just checked actually the error code did not originate from the driver. Distributed transactions is handled by the SQL Server which uses the Microsoft Distributed Transaction Coordinator (MDTC). The driver uses the msqlsrvxa1 library interface to communicate to the DTC and merely returns the result from the call. I could not find anything in their documentation what error should be returned in the case of an AG failover scenario but in this case the server returned RMERR which the driver just passed along. The driver does not do any interpretation of the result.

@frowe
Copy link

frowe commented Sep 23, 2020

So, how do we get MDTC to fix this bug? Do we need to raise a new issue in another repo?

@lilgreenbird
Copy link
Member

I'm not sure if this is a bug? As per your msg above the spec says it should return RMERR if "An error occurred in committing the work performed on behalf of the transaction branch and the branch’s work has been rolled back. Note that returning this error signals a catastrophic event to a transaction manager since other resource managers may successfully commit their work on behalf of this branch. This error should be returned only when a resource manager concludes that it can never commit the branch and that it cannot hold the branch’s resources in a prepared state. Otherwise, [XA_RETRY] should be returned." Or an RMFAIL if "An error occurred that make the resource manager unavailable." Seems up to interpretation what kind of an error a failover is looks MDTC chose the former.

@pa1484
Copy link
Author

pa1484 commented Sep 25, 2020

Hi,

I tried assuming XAER_RMERR as a XAER_RMFAIL which is retriable so, WebSphere retried committing the transaction branch and the resource manager successfully committed the transaction. The spec says, an XAER_RMERR error code should be returned only when a resource manager concludes that it can never commit the transaction branch. But in this case upon retry resource manager could be able to commit the transaction branch. So, I believe this is a bug in either SQL server JDBC driver or MDTC which is returning wrong error code to transaction manager. Please let us know if we need raise a new issue in another repo.

@lilgreenbird
Copy link
Member

hi @pa1484

you will have to go through MS support, who will file an RFC with the appropriate team.

@peterbae peterbae added this to Waiting for Customer in MSSQL JDBC Oct 2, 2020
@peterbae
Copy link
Contributor

Let us know if anyone still needs help from the JDBC team, otherwise I will close this issue in due time.

MSSQL JDBC automation moved this from Waiting for Customer to Closed Issues Nov 10, 2020
@jmstephensgit
Copy link

Does anyone know if this was fixed?

@Jeffery-Wasty
Copy link
Member

This is not reproducible on our end. You will have to go through MS support as was recommended above, to get the most up-to-date information.

@David-Engel
Copy link
Contributor

Looking at the JDBC code, this looks like it is coming from the driver due to the connection being dropped and the socket write failing. The error is not from the server. The error is because the socket write failed when we tried to call the XA function on the server. We should be able to detect this and return RMFAIL instead of RMERR when we can't even make the call as this is different from an actual error sent from the server.

@David-Engel David-Engel reopened this Jan 28, 2023
MSSQL JDBC automation moved this from Closed Issues to Under Investigation Jan 28, 2023
@frowe
Copy link

frowe commented Jan 30, 2023

Thanks for re-examining this, we have a mutual customer that is trying to get AlwaysOn setup and this issue is holding up their progress. If we can do anything to help, let us know

@tkyc tkyc closed this as completed Mar 2, 2023
MSSQL JDBC automation moved this from Under Investigation to Closed Issues Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
MSSQL JDBC
  
Closed Issues
Development

No branches or pull requests

8 participants