New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-1709: Convert withTransaction tests to unified format #1505
Conversation
1743348
to
d4df26a
Compare
@jyemin: Just tagged you in case you'd like to add these to a Java patch build. Feel free to disregard and remove yourself. Note: there was no use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Node just had to skip 2 callback retry tests that I think is a bug in the driver.
Java driver is also failing the two callback-retry tests:
So maybe not a bug in the Node driver, @durran. There are four other tests failing as well, but those seem to be test runner issues. Patch build: https://spruce.mongodb.com/version/65cb8c513e8e86ed75242d71/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed locations of my comments.
documents: | ||
- { _id: 1 } | ||
- | ||
description: callback is not retried after non-transient error (DuplicateKeyError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy test has useMultipleMongoses: true for this one, and the Java driver's legacy test runner skip it when running against a replica set. I'm not sure whether that is correct behavior, but when running this test as a UTR it fails against a 7.0 replica set (it actually times out on Evergreen after about 40 minutes). Can we determine whether it's intended to be run against a replica set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also times out against a sharded cluster, whereas the legacy test passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this test does use useMultipleMongoses: true
. The test file creates two sets of entities and this uses the second set, where the client is configured as such.
AFAICT, there is nothing in the legacy transactions spec test format that would prohibit running a test with useMultipleMongoses: true
on replica sets. The option only applies to sharded and load-balanced topologies.
The first test in this file prohibits multiple mongoses since it configures a fail point (I suppose it could have used a targetedFailPoint
, but none of the withTransaction
tests seem to do so). I don't see any reason it would be incompatible with a replica set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this test is passing for others than it could be an issue with the Java driver UTR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a Node perspective we are getting errors like this:
[2024/02/14 08:47:33.160] 2 MongoServerError: Transaction d461be83-5a2e-472b-aeea-f539187d6c03 - Il1bctjvfYRA0JpLUY4Xle2E6bw+KM1YpMr4x52qGgE= - - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database transaction-tests has undergone a catalog change operation at time Timestamp(1707896852, 152) and no longer satisfies the requirements for the current transaction which requires Timestamp(1707896850, 8). Transaction will be aborted.
[2024/02/14 08:47:33.160] at Connection.sendCommand (/data/mci/87c0cefe0f5d76422caa781a4417a420/src/src/cmap/connection.ts:139:17)
[2024/02/14 08:47:33.160] at processTicksAndRejections (node:internal/process/task_queues:95:5)
[2024/02/14 08:47:33.160] at async Connection.command (/data/mci/87c0cefe0f5d76422caa781a4417a420/src/src/cmap/connection.ts:139:17) {
[2024/02/14 08:47:33.160] ok: 0,
[2024/02/14 08:47:33.160] code: 272,
[2024/02/14 08:47:33.160] codeName: 'MigrationConflict',
[2024/02/14 08:47:33.160] '$clusterTime': {
[2024/02/14 08:47:33.160] clusterTime: new Timestamp({ t: 1707896853, i: 33 }),
[2024/02/14 08:47:33.160] signature: {
[2024/02/14 08:47:33.160] hash: Binary.createFromBase64('GjUJUbMe8Szf9xOinNDFhYbjIJY=', 0),
[2024/02/14 08:47:33.160] keyId: new Long('7335360823633641496')
[2024/02/14 08:47:33.160] }
[2024/02/14 08:47:33.160] },
[2024/02/14 08:47:33.160] operationTime: new Timestamp({ t: 1707896853, i: 33 }),
[2024/02/14 08:47:33.160] [Symbol(errorLabels)]: Set(1) { 'TransientTransactionError' }
[2024/02/14 08:47:33.160] }
which are randomly happening and causing random tests to fail, even in the unified spec runner valid-pass tests. We don't set up any fail points for code 272 so I thought the issue was outside the changes here and something on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I'm running into here is https://jira.mongodb.org/browse/DRIVERS-2816
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jyemin: In response to your top-level comment above:
There are four other tests failing as well, but those seem to be test runner issues.
Patch build: https://spruce.mongodb.com/version/65cb8c513e8e86ed75242d71/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC
I see that two tasks timed out, but don't see any failures within them. Are you referring to something in a different patch build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test that runs for 40 minutes on Evergreen before the Evergreen task times out fails when run locally after about a minute. So I was able to see these four other test failures locally but not yet on Evergreen.
|
||
tests: | ||
- | ||
description: callback succeeds after multiple connection errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails with the Java driver UTR against a 7.0 replica set:
[2024/02/13 11:23:17.424] org.junit.ComparisonFailure: Command names must be equal
[2024/02/13 11:23:17.424] Assertion Context:
[2024/02/13 11:23:17.424] Event Matching Context
[2024/02/13 11:23:17.424] event position: 1
[2024/02/13 11:23:17.424] expected event: {"commandStartedEvent": {"command": {"abortTransaction": 1, "lsid": {"$$sessionLsid": "session0"}, "txnNumber": 1, "autocommit": false, "readConcern": {"$$exists": false}, "startTransaction": {"$$exists": false}, "writeConcern": {"$$exists": false}}, "commandName": "abortTransaction", "databaseName": "admin"}}
[2024/02/13 11:23:17.424] actual event: {"commandStartedEvent": {"command": {"commitTransaction": 1, "$db": "admin", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1707841397, "i": 7}}, "signature": {"hash": {"$binary": {"base64": "M02Xhbame6Ux6mCG3nsbWTV52i0=", "subType": "00"}}, "keyId": 7335118067787104262}}, "lsid": {"id": {"$binary": {"base64": "sFYW5mm/QLq+0OFJa9yqFA==", "subType": "04"}}}, "txnNumber": 1, "autocommit": false}, "databaseName": "admin"}}
The legacy test passes. I haven't done an RCA, but lmk if you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jyemin: I think this may be related to the Java UTR's handling of ignoreResultAndError: true
and deciding to not propagate exceptions from the callback's operations to withTransaction
. ignoreResultAndError
should only apply to the UTR's assertion logic and should not get in the way of how withTransaction()
operates (i.e. sequence of actions in the spec).
Quoting the comment in the spec test:
We do not assert the result here, as insertOne will fail for the first two executions of the callback before ultimately succeeding and returning a result. Asserting the state of the output collection after the test is sufficient.
If the insertOne
exceptions are ignored entirely (not just by the UTF but also by withTransaction()
), then I would expect withTransaction
to commit instead of aborting and retrying the callback -- which is what you're seeing above.
See my other response in #1505 (comment).
arguments: | ||
session: *session0 | ||
document: { _id: 1 } | ||
ignoreResultAndError: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the root of the problem in the Java driver UTR. The spec for ignoreResultAndError says:
If true, both the error and result for the operation MUST be ignored.
The Java driver UTR does that, and so when used inside a callback this operation just returns, the callback returns normally rather than exceptionally, and withTransaction, none the wiser, commits the transaction (which succeeds).
I'm not sure how to say in UTF that an error should be ignored for purposes of asserting the operation result, but, in the context of a callback, should also be propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The withTransaction
spec doesn't discuss callback return values (The callback function has a flexible signature is probably the only commentary on the subject). I know some drivers do relay the callback's return value through withTransaction()
; however, I wouldn't expect this to come into play with the unified test format.
In the UTF, the callback is expressed as a list of operations. I assume the Java implementation attempts to relay the return value of the last operation, but I'd argue there's no basis for doing so in the UTF spec. And there certainly isn't a spec test on the matter.
What is important (and tested) is being able to relay exceptions from the callback. And that is precisely why I resorted to using ignoreResultAndError
here, as it was the only way to preserve the original behavior in the legacy test format where the absence of a result and error expectation allowed the test runner to forgo any assertion.
See: withTransaction
UTF operation in PHPLIB and the corresponding Operation::assert()
method for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The withTransaction spec doesn't discuss callback return values (The callback function has a flexible signature is probably the only commentary on the subject). I know some drivers do relay the callback's return value through withTransaction(); however, I wouldn't expect this to come into play with the unified test format.
Agreed that the return value is irrelevant here.
In the UTF, the callback is expressed as a list of operations. I assume the Java implementation attempts to relay the return value of the last operation, but I'd argue there's no basis for doing so in the UTF spec. And there certainly isn't a spec test on the matter.
It does not attempt to relay the return value.
What is important (and tested) is being able to relay exceptions from the callback. And that is precisely why I resorted to using ignoreResultAndError here, as it was the only way to preserve the original behavior in the legacy test format where the absence of a result and error expectation allowed the test runner to forgo any assertion.
This is what's missing in the Java UTR. It treats any exception thrown by the operation the same as it treats the return value. It asserts on one or the other, depending on the test, and "swallows" both the return value and the exception. (It actually will throw an exception, but only a Junit AssertionException because that's how test failures are communicated in Java).
I think I can change the withTransaction operation to work a bit differently so that the exception is propagated, and get this test to pass.
I also realized that the test that is timing out (callback is not retried after non-transient error (DuplicateKeyError)
) is related to this as well. But unlike this test, the insertOne operation that is expected to fail in that test does not include ignoreResultAndError: true
. I'm not actually sure whether ignoreResultAndError
should change the behavior of error propagation. If in the Java driver UTR I just always propagate errors when inside a transaction callback, regardless of the value of ignoreResultAndError
, both tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If in the Java driver UTR I just always propagate errors when inside a transaction callback, regardless of the value of ignoreResultAndError, both tests pass.
That sounds correct to me, and it's consistent with what I was already doing in PHPLIB. Regardless of ignoreResultAndError
, I don't think there's any reason for the UTR's assertions to interfere with the propagation of exceptions from the callback to withTransaction()
.
It's reasonable that this would come up now, as the original tests in poc-transactions-convenient-api.yml never dealt with errors.
I made sure to note this in the downstream changes for DRIVERS-1709:
Ensure that neither ignoreResultAndError nor expectError interfere with propagating exceptions from a callback operation to withTransaction.
8dd8274
to
d28712f
Compare
@jyemin: While diagnosing some PHPLIB build failures, I realized that several tests probably had to be skipped on serverless. The legacy |
I combined both sets of changes into 1 on the Node side, and the only failures we have are related to DRIVERS-2816. See patch: https://spruce.mongodb.com/version/65cc95eba4cf475fb829553a/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC and PR: mongodb/node-mongodb-native#3987 The other failures coming from these changes I verified and ticketed as Node issues and then skipped:
So at least from a Node perspective I think this looks good. |
d28712f
to
f259730
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changing the Java driver UTR to propagate exceptions from operations within a withTransaction callback, two of the six failing tests now pass. The remaining four are test runner issues and have been skipped. So now there is a clean patch build with the new tests against MongoDB 4.4+ in all deployment types
Unfortunately, there are now new test failures on sharded deployments from 3.6 through 4.2. The failing tests are not the new tests and not even all of them are unified tests. I am not sure what is causing those failures, but you can see them here: mongodb/mongo-java-driver#1310.
Regardless, I'm going to approve this PR as I don't think these test failures are due to the new tests but rather changes I made in the Java UTR.
@jyemin: Just followed up in: mongodb/mongo-java-driver#1310 (comment) For the failures outside of the transactions and |
a386623
to
fd09568
Compare
https://jira.mongodb.org/browse/DRIVERS-1709
Note: this is being split off from #1502
Please complete the following before merging:
clusters, and serverless).
POC: mongodb/mongo-php-library#1226