fix sqs multi-protocol handling and upgrade botocore #9710

alexrashed · 2023-11-22T17:05:02Z

Motivation

This PR aims at cleaning up after the quick fixes with the protocol switch in SQS of botocore from query to json (and now back again).

These changes of the protocol are highly problematic, because this means that clients are suddenly changing the way how they talk to AWS (and in turn also with LocalStack). There's nothing we can do on our side besides implementing the new protocol (in addition to the old protocol to remain downwards-compatible with other or older clients).

Here's a timeline of what happened:

2023-05-04 - botocore switched the protocol for SQS from query to json in version 1.29.127 (specifically this commit: boto/botocore@2395012).
- The changes were quickly hitting LocalStack users: bug: Creating SQS queue fails with botocore==1.29.127 & botocore==1.31.81 #8267
2023-05-05 - botocore reverts the change with Revert SQS protocol changes boto/botocore#2931 in version 1.29.128.
2023-11-09 - botocore again performs the switch in version 1.31.81 (specifically this commit: boto/botocore@84c7847).
2023-11-11 - Two days after the switch, we were ready with a somewhat hacky solution: fix SQS JSON protocol support in ASF #8268
- fix SQS JSON leftovers #9607 - Performs a quick cleanup after the initial merge of the new protocol support.
- fix SQS json requests sent to query route #9634 - Fixes an issue in the routing
- update ASF APIs, fix SQS serialization #9611 - Adjusts the ASF specs and the provider implementation accordingly
- remove sqs-query from health endpoint #9709 - Removes the artificial service from the list of services in the health endpoint
2023-11-11 - botocore changes the exception handling for SQS in version 1.31.85 in Fix Sqs ErrorType not being parsed properly boto/botocore#3054.
- These changes have been addressed with refresh sqs snapshots and fix exceptions in serializer #9627.
2023-11-15 - botocore switches to gzipped specs which is not compatible with moto at the time.
- 815d6c7 pins botocore as a first response.
2023-11-22 - Bump moto-ext to 4.2.9.post2 #9624 upgrades moto such that it contains Techdebt: Fix compatibility with botocore 1.32.1 getmoto/moto#7030.
2023-11-22 - unpin botocore, update minimum version #9667 unpins botocore again
2023-11-22 - botocore switches the protocol for SQS back from json to query in version 1.32.6 (specifically this PR: Revert SQS protocol changes boto/botocore#3071)
- avoid botocore 1.32.6 #9714 pins botocore as a first response.

This is where we are at. With this PR we want to:

Get rid of the artificial "sqs-query" service. The term sqs-query should only be used in very specific sections explicitly handling the query spec.
Remove the pin on botocore, effectively switching back to query as the default protocol, but still support json.
Remove the pin of AWS CLI in the Dockerfile introduced with pin global awscli install in Dockerfile #9767.

Changes

The first commit cleans up after the previous efforts to adjust to the chaotic changes around the protocol changes in SQS.
- This change centers around switching from differentiating the requests on protocol level, rather than on service level.
- It removes the sqs-query service "alias", and removes all special cases handling this service alias.
- It adjusts the serializer and parser such that they differentiate based on the protocol of the given service model (instead of only based on the service).
- It adjusts the service_router such that it delivers a service model instead of only the service name (because that would be ambiguous, there are two service models for SQS).
The second commit upgrades the ASF API stubs based on the latest botocore version (since they had to be hold back, see Update ASF APIs #9735).
The third commit actually performs the upgrade / unpinning of botocore.
- This means another switch from json back to query as default protocol for SQS.
- The internalized spec for SQS is switched from query to json (the "non-default" protocol).
- Changes the serializer adjustments which are necessary due to the duality of the protocols, their divergence, and the fact that we only generate a single ASF stub based on one of the specs.
- Adjusts the SQS provider to some recent changes (empty lists / dicts are not contained anymore in certain cases, like messages or tags).
- Fixes / updates quite a lot of tests and snapshots.
  - Also parameterizes most of the SQS tests such that they are snapshot tested against both protocols.
The fourth commit contains some test fixes related to the new changes in the S3 specs / generated code. /cc @bentsku

github-actions · 2023-12-05T13:36:16Z

S3 Image Test Results (AMD64 / ARM64)

2 files ±0 2 suites ±0 3m 8s ⏱️ -4s
386 tests ±0 336 ✅ ±0 50 💤 ±0 0 ❌ ±0
772 runs ±0 672 ✅ ±0 100 💤 ±0 0 ❌ ±0

Results for commit 72d8615. ± Comparison against base commit 124a4e2.

♻️ This comment has been updated with latest results.

coveralls · 2023-12-05T16:52:08Z

coverage: 83.988% (+0.06%) from 83.926%
when pulling 72d8615 on remove-sqs-query-service
into 124a4e2 on master.

github-actions · 2023-12-07T12:40:47Z

LocalStack Community integration with Pro

2 files 2 suites 1h 17m 41s ⏱️
2 572 tests 2 340 ✅ 232 💤 0 ❌
2 573 runs 2 340 ✅ 233 💤 0 ❌

Results for commit 72d8615.

♻️ This comment has been updated with latest results.

baermat

Wow, this is quite the undertaking. Many thanks for tackling this! I will only comment on the sqs provider side of things, where I saw one thing I wanted to clarify (which happens with multiple tests, so my comments all address the same potential issue)

baermat · 2023-12-07T16:36:28Z

tests/aws/services/sqs/test_sqs.py

-        assert "Messages" in result
-        assert result["Messages"] == []
+        assert "Messages" not in result or result["Messages"] == []


I realize this isn't the most pressing issue here, just as a remark: as the test is aws_validated, it seems odd that "Messages" either doesn't exist or is an empty list. So I guess this differs between the two protocols? In that case, is there a reasonably easy way to check according to which protocol is tested right now? Because right now both values are valid for both protocols. Or do/did we accept this as limitation?

For empty lists, AWS often omits the field entirely, independent of the protocol. I noticed this behavior recently while parity testing different SDK versions (which use different protocols) in the Lambda multiruntime tests.

I moved my Notion testing page to the SQS service to demonstrate the behavior for the listQueues operation (see here). It looks like we have an SQS parity issue when listing empty queues (unrelated to this PR).

Thanks for the comment! Since this is tested in dozens of the snapshot tests (specifically for each individual protocol due to the new fixture being used in the tests), I wanted to make these assertions independent of the protocol. Also because the next switch is around the corner (I'm quite sure the botocore team will try another run on the switch somewhat soon).
Also, it seems like they are currently actually changing the behavior server-side (the json service is not returning empty lists anymore). But a gain, I think this is something which should belong in the snapshot verification. Here we actually semantically want to check if there are no messages left (and not explicitly want to test the protocol).

I'll close the other comments concerning this, but I'm happy to discuss this. If we want to change this, then obviously I'll apply it everywhere.

tests/aws/services/sqs/test_sqs.py

bentsku

Wow, this is an extensive set of changes. Biggest takeaway the clean separation between the protocol and the service name, made possible with the nice changes in the service router!
We've also had a sync looking at a lot of test changes, and it looks great! I like the parametrization making sure we properly test both protocols.

There's a lot of clean up in here, and it seems #9828 has been merged, which would maybe need additional snapshots and parametrization as well.

All in all, LGTM! Thank you so much for addressing this and making a nice clean fix 🚀

localstack/aws/spec.py

tests/aws/services/s3/test_s3.py

thrau

Overall this is already a huge improvement over what we had before! I do have a few open questions/comments that I'd like to understand before approving, but overall really great work!

tests/unit/aws/protocol/test_parser.py

thrau · 2023-12-11T10:11:15Z

tests/unit/aws/protocol/test_serializer.py

 def test_query_protocol_error_serialization():
    exception = InvalidMessageContents("Exception message!")
    _botocore_error_serializer_integration_test(
-        "sqs-query", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!"
+        "sqs", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!"
    )


since this is specifically a test for the query protocol, should we maybe use a service that we know is going to use an unaltered version of the query serializer?

SQS is used a lot across the parser tests. In fact, the parser and serializer tests should be refactored and structured in the future, but I would like to avoid adding this to this PR, if it's okay for you.

localstack/aws/protocol/service_router.py

thrau · 2023-12-11T10:17:53Z

localstack/aws/protocol/service_router.py

@@ -275,28 +276,32 @@ def get_service_catalog() -> ServiceCatalog:
        return ServiceCatalog()


-def resolve_conflicts(candidates: Set[str], request: Request):
+def resolve_conflicts(candidates: Set[ServiceModel], request: Request):


are we still passing a Set[ServiceModel] ? is ServiceModel even hashable?

It was hashable, yes, but it's a good point. The service router has now been restructured such that the different stages are working based on service model identifiers, and the main function loads and returns the model.
So in the latest revision, this specific parameter has been changed to a set of ServiceIdentifier (a named tuple containing the service name and the protocol).

thrau · 2023-12-11T10:19:52Z

localstack/aws/protocol/service_router.py

        content_type = request.headers.get("Content-Type")
-        return "sqs" if content_type == "application/x-amz-json-1.0" else "sqs-query"
+        return "sqs-json" if content_type == "application/x-amz-json-1.0" else "sqs"


nit: i was under the impression we were getting rid of the virtual service names? shouldn't this then return the service model instead? if not, should we introduce some form of value type for the (service,version,protocol) tuple which should uniquely identify a service model?

Thanks for you input, this is how it's working now (we have a named tuple uniquely identifying a loadable service model), which is the return type of the conflict resolution / heuristic stage functions.

localstack/aws/protocol/service_router.py

thrau · 2023-12-11T10:27:36Z

localstack/aws/spec.py

    @cached_property
-    def target_prefix_index(self) -> Dict[str, List[ServiceName]]:
+    def target_prefix_index(self) -> Dict[str, List[ServiceModel]]:


We seem to now store ServiceModel instances in the catalog index, which is the one we serialize and save to disk. How does this affect the size of the cache file and the performance of serializing/deserializing the cache?

I would be more inclined to keep cache structures to primitives types when they are stored. So something like the service,version,protocol tuple i mentioned earlier could be useful for that?

This is a very good point, the model caching totally slipped my mind.
I changed the index such that it now only uses ServiceModelIdentifier (a NamedTuple).
This obviously increases the size of the service catalog pickle, but just by ~225kb instead of multiple megabytes: 383,9kb (old) compared to 608,6kb (new)
This results in the following building and loading times of the service catalog:

Stage Previous Index New Index

Building 2.0 - 2.1s 1.8 - 2.2s

Loading 14 - 15ms 27 - 40ms

However, the variance was quite high for my tests (on my notebook).
But we can see that the building doesn't really take considerably longer, the loading takes double the time, but is only executed once per LocalStack run and it's an impact of max. 25ms.

localstack/aws/handlers/service.py

localstack/aws/protocol/parser.py

thrau · 2023-12-11T18:32:25Z

localstack/aws/spec.py

+    # the service name needs to be set to standard sqs
+    if service == "sqs-json":
+        service = "sqs"
    return ServiceModel(service_description, service)


It would be really good if we could get protocol: str as an optional parameter into our load_service implementation instead of relying on the fake service names. i understand that we currently need them under the hood to know which spec file to load, but basically our api should understand protocols as first-class citizen.

This is a good point. The new solution does not use the sqs-json service anymore in the code.
However, using a custom service name has a big benefits: This way it integrates really nicely with botocore and our internal AWS client factory. We can just use aws_client.sqs_json to load a client which reads the spec we ship in the localstack module. This can be really helpful in the tests, and might also come in handy for other use cases. So we would like to have two ways to load a service with a specific protocol:

Add the suffix to the service name (f.e. sqs-json), which enables the use case mentioned above.

Add the protocol to the service loading function, to have a clean and clear way to define a specific protocol of a service to load.

I also tried to avoid patching botocore, but since the spec#load_service is the only function using the custom loader, which is why this custom logic (implementing the protocol-specific loading) is contained in that function.

thrau

fantastic job @alexrashed! thanks a lot for your patience and persistence on this long standing issue! i think we found a good solution for now, and it's exciting to see how we're really pushing the boundaries of botocore and what we need from it.

well done!

thrau · 2024-01-05T20:06:28Z

merging the PR to get some data over the weekend. will keep an eye on it!

alexrashed added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label Nov 22, 2023

alexrashed added this to the 3.1 milestone Nov 22, 2023

alexrashed self-assigned this Nov 22, 2023

This was referenced Nov 22, 2023

remove sqs-query from health endpoint #9709

Merged

avoid botocore 1.32.6 #9714

Merged

Update ASF APIs #9735

Closed

localstack deleted a comment from localstack-bot Nov 29, 2023

alexrashed mentioned this pull request Nov 29, 2023

pin global awscli install in Dockerfile #9767

Merged

alexrashed force-pushed the remove-sqs-query-service branch from f01ddec to cb8f2cf Compare December 5, 2023 13:30

alexrashed force-pushed the remove-sqs-query-service branch 2 times, most recently from 0ebd655 to ab02b47 Compare December 5, 2023 16:09

alexrashed force-pushed the remove-sqs-query-service branch 3 times, most recently from c092a3c to 3b792e5 Compare December 7, 2023 09:22

alexrashed changed the title ~~remove sqs-query service alias~~ remove sqs-query service alias, upgrade botocore Dec 7, 2023

alexrashed force-pushed the remove-sqs-query-service branch 2 times, most recently from 7619561 to 09a4c3f Compare December 7, 2023 11:57

alexrashed marked this pull request as ready for review December 7, 2023 14:03

alexrashed requested review from sannya-singal, ackdav, dominikschubert, MEPalma, thrau, baermat, bentsku, macnev2013 and silv-io as code owners December 7, 2023 14:03

baermat reviewed Dec 7, 2023

View reviewed changes

joe4dev mentioned this pull request Dec 8, 2023

Fix AWS parity for SQS list queues with empty response #9828

Merged

bentsku approved these changes Dec 11, 2023

View reviewed changes

localstack/aws/spec.py Outdated Show resolved Hide resolved

tests/aws/services/s3/test_s3.py Show resolved Hide resolved

thrau reviewed Dec 11, 2023

View reviewed changes

thrau mentioned this pull request Dec 28, 2023

bug: Wrong message body format on SQS message received through ReceiveMessage action. #8451

Open

1 task

alexrashed force-pushed the remove-sqs-query-service branch from 2d78884 to 6fa2171 Compare January 4, 2024 10:31

alexrashed requested review from giograno and pinzon as code owners January 4, 2024 10:31

alexrashed force-pushed the remove-sqs-query-service branch 3 times, most recently from f9b30e2 to 005a20a Compare January 4, 2024 18:08

This was referenced Jan 4, 2024

fix sqs SendMessageBatch result for SDK compatibility #9998

Closed

bug: Localstack doesn't work properly with SQS client from AWS SDK v2 version 2.21.18 and above #9832

Closed

alexrashed force-pushed the remove-sqs-query-service branch 2 times, most recently from 01734d5 to b14b789 Compare January 5, 2024 08:25

alexrashed added 2 commits January 5, 2024 12:02

remove sqs-query service alias, differentiate based on protocol

e353aab

update ASF specs with botocore==1.33.8

7428adc

alexrashed force-pushed the remove-sqs-query-service branch from b14b789 to e151e88 Compare January 5, 2024 11:16

alexrashed requested a review from thrau January 5, 2024 11:17

alexrashed force-pushed the remove-sqs-query-service branch from e151e88 to 5afd969 Compare January 5, 2024 11:29

alexrashed added 3 commits January 5, 2024 13:52

upgrade / unpin botocore, switch sqs protocol, fix tests

d894cda

fix tests and signatures after ASF stub updates

2c8e246

refactor service loading based on protocol

72d8615

alexrashed force-pushed the remove-sqs-query-service branch from 5afd969 to 72d8615 Compare January 5, 2024 12:52

thrau approved these changes Jan 5, 2024

View reviewed changes

thrau changed the title ~~remove sqs-query service alias, upgrade botocore~~ fix sqs multi-protocol handling and upgrade botocore Jan 5, 2024

thrau merged commit 41b4d23 into master Jan 5, 2024
34 checks passed

thrau deleted the remove-sqs-query-service branch January 5, 2024 20:06

alexrashed mentioned this pull request Apr 30, 2024

update ASF APIs, switch SQS to JSON again #10741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix sqs multi-protocol handling and upgrade botocore #9710

fix sqs multi-protocol handling and upgrade botocore #9710

alexrashed commented Nov 22, 2023 •

edited

github-actions bot commented Dec 5, 2023 •

edited

coveralls commented Dec 5, 2023 •

edited

github-actions bot commented Dec 7, 2023 •

edited

baermat left a comment

baermat Dec 7, 2023

joe4dev Dec 7, 2023

alexrashed Dec 7, 2023

bentsku left a comment

thrau left a comment

thrau Dec 11, 2023

alexrashed Jan 4, 2024

thrau Dec 11, 2023

alexrashed Jan 5, 2024

thrau Dec 11, 2023

alexrashed Jan 5, 2024

thrau Dec 11, 2023

alexrashed Jan 5, 2024

thrau Dec 11, 2023 •

edited

alexrashed Jan 5, 2024

thrau left a comment

thrau commented Jan 5, 2024

fix sqs multi-protocol handling and upgrade botocore #9710

fix sqs multi-protocol handling and upgrade botocore #9710

Conversation

alexrashed commented Nov 22, 2023 • edited

Motivation

Changes

github-actions bot commented Dec 5, 2023 • edited

S3 Image Test Results (AMD64 / ARM64)

coveralls commented Dec 5, 2023 • edited

github-actions bot commented Dec 7, 2023 • edited

LocalStack Community integration with Pro

baermat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bentsku left a comment

Choose a reason for hiding this comment

thrau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thrau Dec 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thrau left a comment

Choose a reason for hiding this comment

thrau commented Jan 5, 2024

alexrashed commented Nov 22, 2023 •

edited

github-actions bot commented Dec 5, 2023 •

edited

coveralls commented Dec 5, 2023 •

edited

github-actions bot commented Dec 7, 2023 •

edited

thrau Dec 11, 2023 •

edited