Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-1709: Convert withTransaction tests to unified format #1505

Merged
merged 16 commits into from Feb 15, 2024

Conversation

jmikola
Copy link
Member

@jmikola jmikola commented Feb 7, 2024

https://jira.mongodb.org/browse/DRIVERS-1709

Note: this is being split off from #1502

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded
    clusters, and serverless).

POC: mongodb/mongo-php-library#1226

@jmikola jmikola force-pushed the drivers-1709-convenient-api branch 4 times, most recently from 1743348 to d4df26a Compare February 9, 2024 02:11
jmikola added a commit to jmikola/mongo-php-library that referenced this pull request Feb 9, 2024
@jmikola jmikola marked this pull request as ready for review February 9, 2024 02:15
@jmikola jmikola requested review from a team as code owners February 9, 2024 02:15
@jmikola jmikola requested review from alcaeus and removed request for a team February 9, 2024 02:15
jmikola added a commit to jmikola/mongo-php-library that referenced this pull request Feb 9, 2024
@durran durran self-assigned this Feb 12, 2024
@jmikola jmikola requested a review from jyemin February 13, 2024 01:33
@jmikola
Copy link
Member Author

jmikola commented Feb 13, 2024

@jyemin: Just tagged you in case you'd like to add these to a Java patch build. Feel free to disregard and remove yourself.

Note: there was no use of targetedFailPoint in the withTransaction tests, so no further modifications were made to skip load-balanced topologies.

Copy link
Member

@durran durran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Node just had to skip 2 callback retry tests that I think is a bug in the driver.

@jyemin
Copy link
Contributor

jyemin commented Feb 13, 2024

Java driver is also failing the two callback-retry tests:

  1. callback is not retried after non-transient error (DuplicateKeyError)
  2. callback succeeds after multiple connection errors

So maybe not a bug in the Node driver, @durran.

There are four other tests failing as well, but those seem to be test runner issues.

Patch build: https://spruce.mongodb.com/version/65cb8c513e8e86ed75242d71/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed locations of my comments.

documents:
- { _id: 1 }
-
description: callback is not retried after non-transient error (DuplicateKeyError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy test has useMultipleMongoses: true for this one, and the Java driver's legacy test runner skip it when running against a replica set. I'm not sure whether that is correct behavior, but when running this test as a UTR it fails against a 7.0 replica set (it actually times out on Evergreen after about 40 minutes). Can we determine whether it's intended to be run against a replica set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also times out against a sharded cluster, whereas the legacy test passes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test does use useMultipleMongoses: true. The test file creates two sets of entities and this uses the second set, where the client is configured as such.

AFAICT, there is nothing in the legacy transactions spec test format that would prohibit running a test with useMultipleMongoses: true on replica sets. The option only applies to sharded and load-balanced topologies.

The first test in this file prohibits multiple mongoses since it configures a fail point (I suppose it could have used a targetedFailPoint, but none of the withTransaction tests seem to do so). I don't see any reason it would be incompatible with a replica set.

Copy link
Contributor

@jyemin jyemin Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test is passing for others than it could be an issue with the Java driver UTR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Node perspective we are getting errors like this:

[2024/02/14 08:47:33.160] 2 MongoServerError: Transaction d461be83-5a2e-472b-aeea-f539187d6c03 - Il1bctjvfYRA0JpLUY4Xle2E6bw+KM1YpMr4x52qGgE= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database transaction-tests has undergone a catalog change operation at time Timestamp(1707896852, 152) and no longer satisfies the requirements for the current transaction which requires Timestamp(1707896850, 8). Transaction will be aborted.
[2024/02/14 08:47:33.160]     at Connection.sendCommand (/data/mci/87c0cefe0f5d76422caa781a4417a420/src/src/cmap/connection.ts:139:17)
[2024/02/14 08:47:33.160]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[2024/02/14 08:47:33.160]     at async Connection.command (/data/mci/87c0cefe0f5d76422caa781a4417a420/src/src/cmap/connection.ts:139:17) {
[2024/02/14 08:47:33.160]   ok: 0,
[2024/02/14 08:47:33.160]   code: 272,
[2024/02/14 08:47:33.160]   codeName: 'MigrationConflict',
[2024/02/14 08:47:33.160]   '$clusterTime': {
[2024/02/14 08:47:33.160]     clusterTime: new Timestamp({ t: 1707896853, i: 33 }),
[2024/02/14 08:47:33.160]     signature: {
[2024/02/14 08:47:33.160]       hash: Binary.createFromBase64('GjUJUbMe8Szf9xOinNDFhYbjIJY=', 0),
[2024/02/14 08:47:33.160]       keyId: new Long('7335360823633641496')
[2024/02/14 08:47:33.160]     }
[2024/02/14 08:47:33.160]   },
[2024/02/14 08:47:33.160]   operationTime: new Timestamp({ t: 1707896853, i: 33 }),
[2024/02/14 08:47:33.160]   [Symbol(errorLabels)]: Set(1) { 'TransientTransactionError' }
[2024/02/14 08:47:33.160] }

which are randomly happening and causing random tests to fail, even in the unified spec runner valid-pass tests. We don't set up any fail points for code 272 so I thought the issue was outside the changes here and something on our side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'm running into here is https://jira.mongodb.org/browse/DRIVERS-2816

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyemin: In response to your top-level comment above:

There are four other tests failing as well, but those seem to be test runner issues.

Patch build: https://spruce.mongodb.com/version/65cb8c513e8e86ed75242d71/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

I see that two tasks timed out, but don't see any failures within them. Are you referring to something in a different patch build?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test that runs for 40 minutes on Evergreen before the Evergreen task times out fails when run locally after about a minute. So I was able to see these four other test failures locally but not yet on Evergreen.


tests:
-
description: callback succeeds after multiple connection errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails with the Java driver UTR against a 7.0 replica set:

[2024/02/13 11:23:17.424]     org.junit.ComparisonFailure: Command names must be equal
[2024/02/13 11:23:17.424]     Assertion Context:
[2024/02/13 11:23:17.424]     Event Matching Context
[2024/02/13 11:23:17.424]        event position: 1
[2024/02/13 11:23:17.424]        expected event: {"commandStartedEvent": {"command": {"abortTransaction": 1, "lsid": {"$$sessionLsid": "session0"}, "txnNumber": 1, "autocommit": false, "readConcern": {"$$exists": false}, "startTransaction": {"$$exists": false}, "writeConcern": {"$$exists": false}}, "commandName": "abortTransaction", "databaseName": "admin"}}
[2024/02/13 11:23:17.424]        actual event:   {"commandStartedEvent": {"command": {"commitTransaction": 1, "$db": "admin", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1707841397, "i": 7}}, "signature": {"hash": {"$binary": {"base64": "M02Xhbame6Ux6mCG3nsbWTV52i0=", "subType": "00"}}, "keyId": 7335118067787104262}}, "lsid": {"id": {"$binary": {"base64": "sFYW5mm/QLq+0OFJa9yqFA==", "subType": "04"}}}, "txnNumber": 1, "autocommit": false}, "databaseName": "admin"}}

The legacy test passes. I haven't done an RCA, but lmk if you need it.

Copy link
Member Author

@jmikola jmikola Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyemin: I think this may be related to the Java UTR's handling of ignoreResultAndError: true and deciding to not propagate exceptions from the callback's operations to withTransaction. ignoreResultAndError should only apply to the UTR's assertion logic and should not get in the way of how withTransaction() operates (i.e. sequence of actions in the spec).

Quoting the comment in the spec test:

We do not assert the result here, as insertOne will fail for the first two executions of the callback before ultimately succeeding and returning a result. Asserting the state of the output collection after the test is sufficient.

If the insertOne exceptions are ignored entirely (not just by the UTF but also by withTransaction()), then I would expect withTransaction to commit instead of aborting and retrying the callback -- which is what you're seeing above.

See my other response in #1505 (comment).

arguments:
session: *session0
document: { _id: 1 }
ignoreResultAndError: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the root of the problem in the Java driver UTR. The spec for ignoreResultAndError says:

If true, both the error and result for the operation MUST be ignored.

The Java driver UTR does that, and so when used inside a callback this operation just returns, the callback returns normally rather than exceptionally, and withTransaction, none the wiser, commits the transaction (which succeeds).

I'm not sure how to say in UTF that an error should be ignored for purposes of asserting the operation result, but, in the context of a callback, should also be propagated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The withTransaction spec doesn't discuss callback return values (The callback function has a flexible signature is probably the only commentary on the subject). I know some drivers do relay the callback's return value through withTransaction(); however, I wouldn't expect this to come into play with the unified test format.

In the UTF, the callback is expressed as a list of operations. I assume the Java implementation attempts to relay the return value of the last operation, but I'd argue there's no basis for doing so in the UTF spec. And there certainly isn't a spec test on the matter.

What is important (and tested) is being able to relay exceptions from the callback. And that is precisely why I resorted to using ignoreResultAndError here, as it was the only way to preserve the original behavior in the legacy test format where the absence of a result and error expectation allowed the test runner to forgo any assertion.

See: withTransaction UTF operation in PHPLIB and the corresponding Operation::assert() method for reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The withTransaction spec doesn't discuss callback return values (The callback function has a flexible signature is probably the only commentary on the subject). I know some drivers do relay the callback's return value through withTransaction(); however, I wouldn't expect this to come into play with the unified test format.

Agreed that the return value is irrelevant here.

In the UTF, the callback is expressed as a list of operations. I assume the Java implementation attempts to relay the return value of the last operation, but I'd argue there's no basis for doing so in the UTF spec. And there certainly isn't a spec test on the matter.

It does not attempt to relay the return value.

What is important (and tested) is being able to relay exceptions from the callback. And that is precisely why I resorted to using ignoreResultAndError here, as it was the only way to preserve the original behavior in the legacy test format where the absence of a result and error expectation allowed the test runner to forgo any assertion.

This is what's missing in the Java UTR. It treats any exception thrown by the operation the same as it treats the return value. It asserts on one or the other, depending on the test, and "swallows" both the return value and the exception. (It actually will throw an exception, but only a Junit AssertionException because that's how test failures are communicated in Java).

I think I can change the withTransaction operation to work a bit differently so that the exception is propagated, and get this test to pass.

I also realized that the test that is timing out (callback is not retried after non-transient error (DuplicateKeyError)) is related to this as well. But unlike this test, the insertOne operation that is expected to fail in that test does not include ignoreResultAndError: true. I'm not actually sure whether ignoreResultAndError should change the behavior of error propagation. If in the Java driver UTR I just always propagate errors when inside a transaction callback, regardless of the value of ignoreResultAndError, both tests pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If in the Java driver UTR I just always propagate errors when inside a transaction callback, regardless of the value of ignoreResultAndError, both tests pass.

That sounds correct to me, and it's consistent with what I was already doing in PHPLIB. Regardless of ignoreResultAndError, I don't think there's any reason for the UTR's assertions to interfere with the propagation of exceptions from the callback to withTransaction().

It's reasonable that this would come up now, as the original tests in poc-transactions-convenient-api.yml never dealt with errors.

I made sure to note this in the downstream changes for DRIVERS-1709:

Ensure that neither ignoreResultAndError nor expectError interfere with propagating exceptions from a callback operation to withTransaction.

jmikola added a commit to jmikola/mongo-php-library that referenced this pull request Feb 13, 2024
@jmikola
Copy link
Member Author

jmikola commented Feb 13, 2024

@jyemin: While diagnosing some PHPLIB build failures, I realized that several tests probably had to be skipped on serverless. The legacy withTransaction tests never included load-balanced so they were always excluded on serverless.

@durran
Copy link
Member

durran commented Feb 14, 2024

I combined both sets of changes into 1 on the Node side, and the only failures we have are related to DRIVERS-2816. See patch: https://spruce.mongodb.com/version/65cc95eba4cf475fb829553a/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC and PR: mongodb/node-mongodb-native#3987

The other failures coming from these changes I verified and ticketed as Node issues and then skipped:

So at least from a Node perspective I think this looks good.

@jyemin jyemin self-requested a review February 14, 2024 20:31
Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing the Java driver UTR to propagate exceptions from operations within a withTransaction callback, two of the six failing tests now pass. The remaining four are test runner issues and have been skipped. So now there is a clean patch build with the new tests against MongoDB 4.4+ in all deployment types

Unfortunately, there are now new test failures on sharded deployments from 3.6 through 4.2. The failing tests are not the new tests and not even all of them are unified tests. I am not sure what is causing those failures, but you can see them here: mongodb/mongo-java-driver#1310.

Regardless, I'm going to approve this PR as I don't think these test failures are due to the new tests but rather changes I made in the Java UTR.

@durran durran removed their assignment Feb 15, 2024
@jmikola
Copy link
Member Author

jmikola commented Feb 15, 2024

Unfortunately, there are now new test failures on sharded deployments from 3.6 through 4.2. The failing tests are not the new tests and not even all of them are unified tests. I am not sure what is causing those failures, but you can see them here: mongodb/mongo-java-driver#1310.

@jyemin: Just followed up in: mongodb/mongo-java-driver#1310 (comment)

For the failures outside of the transactions and withTransaction spec tests, it may be helpful to schedule more limited builds without these suites to see if there's something with these tests leaving the cluster in a state where other spec suites are failing down the line.

@jmikola jmikola merged commit e51a7bc into mongodb:master Feb 15, 2024
4 checks passed
@jmikola jmikola deleted the drivers-1709-convenient-api branch February 15, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants