Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sqs multi-protocol handling and upgrade botocore #9710

Merged
merged 5 commits into from Jan 5, 2024

Conversation

alexrashed
Copy link
Member

@alexrashed alexrashed commented Nov 22, 2023

Motivation

This PR aims at cleaning up after the quick fixes with the protocol switch in SQS of botocore from query to json (and now back again).

These changes of the protocol are highly problematic, because this means that clients are suddenly changing the way how they talk to AWS (and in turn also with LocalStack). There's nothing we can do on our side besides implementing the new protocol (in addition to the old protocol to remain downwards-compatible with other or older clients).

Here's a timeline of what happened:

This is where we are at. With this PR we want to:

  • Get rid of the artificial "sqs-query" service. The term sqs-query should only be used in very specific sections explicitly handling the query spec.
  • Remove the pin on botocore, effectively switching back to query as the default protocol, but still support json.
  • Remove the pin of AWS CLI in the Dockerfile introduced with pin global awscli install in Dockerfile #9767.

Changes

  • The first commit cleans up after the previous efforts to adjust to the chaotic changes around the protocol changes in SQS.
    • This change centers around switching from differentiating the requests on protocol level, rather than on service level.
    • It removes the sqs-query service "alias", and removes all special cases handling this service alias.
    • It adjusts the serializer and parser such that they differentiate based on the protocol of the given service model (instead of only based on the service).
    • It adjusts the service_router such that it delivers a service model instead of only the service name (because that would be ambiguous, there are two service models for SQS).
  • The second commit upgrades the ASF API stubs based on the latest botocore version (since they had to be hold back, see Update ASF APIs #9735).
  • The third commit actually performs the upgrade / unpinning of botocore.
    • This means another switch from json back to query as default protocol for SQS.
    • The internalized spec for SQS is switched from query to json (the "non-default" protocol).
    • Changes the serializer adjustments which are necessary due to the duality of the protocols, their divergence, and the fact that we only generate a single ASF stub based on one of the specs.
    • Adjusts the SQS provider to some recent changes (empty lists / dicts are not contained anymore in certain cases, like messages or tags).
    • Fixes / updates quite a lot of tests and snapshots.
      • Also parameterizes most of the SQS tests such that they are snapshot tested against both protocols.
  • The fourth commit contains some test fixes related to the new changes in the S3 specs / generated code. /cc @bentsku

@alexrashed alexrashed added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label Nov 22, 2023
@alexrashed alexrashed added this to the 3.1 milestone Nov 22, 2023
@alexrashed alexrashed self-assigned this Nov 22, 2023
@localstack localstack deleted a comment from localstack-bot Nov 29, 2023
Copy link

github-actions bot commented Dec 5, 2023

S3 Image Test Results (AMD64 / ARM64)

  2 files  ±0    2 suites  ±0   3m 8s ⏱️ -4s
386 tests ±0  336 ✅ ±0   50 💤 ±0  0 ❌ ±0 
772 runs  ±0  672 ✅ ±0  100 💤 ±0  0 ❌ ±0 

Results for commit 72d8615. ± Comparison against base commit 124a4e2.

♻️ This comment has been updated with latest results.

@alexrashed alexrashed force-pushed the remove-sqs-query-service branch 2 times, most recently from 0ebd655 to ab02b47 Compare December 5, 2023 16:09
@coveralls
Copy link

coveralls commented Dec 5, 2023

Coverage Status

coverage: 83.988% (+0.06%) from 83.926%
when pulling 72d8615 on remove-sqs-query-service
into 124a4e2 on master.

@alexrashed alexrashed force-pushed the remove-sqs-query-service branch 3 times, most recently from c092a3c to 3b792e5 Compare December 7, 2023 09:22
@alexrashed alexrashed changed the title remove sqs-query service alias remove sqs-query service alias, upgrade botocore Dec 7, 2023
@alexrashed alexrashed force-pushed the remove-sqs-query-service branch 2 times, most recently from 7619561 to 09a4c3f Compare December 7, 2023 11:57
Copy link

github-actions bot commented Dec 7, 2023

LocalStack Community integration with Pro

    2 files      2 suites   1h 17m 41s ⏱️
2 572 tests 2 340 ✅ 232 💤 0 ❌
2 573 runs  2 340 ✅ 233 💤 0 ❌

Results for commit 72d8615.

♻️ This comment has been updated with latest results.

Copy link
Member

@baermat baermat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is quite the undertaking. Many thanks for tackling this! I will only comment on the sqs provider side of things, where I saw one thing I wanted to clarify (which happens with multiple tests, so my comments all address the same potential issue)

Comment on lines -254 to +272
assert "Messages" in result
assert result["Messages"] == []
assert "Messages" not in result or result["Messages"] == []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this isn't the most pressing issue here, just as a remark: as the test is aws_validated, it seems odd that "Messages" either doesn't exist or is an empty list. So I guess this differs between the two protocols? In that case, is there a reasonably easy way to check according to which protocol is tested right now? Because right now both values are valid for both protocols. Or do/did we accept this as limitation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For empty lists, AWS often omits the field entirely, independent of the protocol. I noticed this behavior recently while parity testing different SDK versions (which use different protocols) in the Lambda multiruntime tests.

I moved my Notion testing page to the SQS service to demonstrate the behavior for the listQueues operation (see here). It looks like we have an SQS parity issue when listing empty queues (unrelated to this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! Since this is tested in dozens of the snapshot tests (specifically for each individual protocol due to the new fixture being used in the tests), I wanted to make these assertions independent of the protocol. Also because the next switch is around the corner (I'm quite sure the botocore team will try another run on the switch somewhat soon).
Also, it seems like they are currently actually changing the behavior server-side (the json service is not returning empty lists anymore). But a gain, I think this is something which should belong in the snapshot verification. Here we actually semantically want to check if there are no messages left (and not explicitly want to test the protocol).

I'll close the other comments concerning this, but I'm happy to discuss this. If we want to change this, then obviously I'll apply it everywhere.

tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
tests/aws/services/sqs/test_sqs.py Show resolved Hide resolved
Copy link
Contributor

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is an extensive set of changes. Biggest takeaway the clean separation between the protocol and the service name, made possible with the nice changes in the service router!
We've also had a sync looking at a lot of test changes, and it looks great! I like the parametrization making sure we properly test both protocols.

There's a lot of clean up in here, and it seems #9828 has been merged, which would maybe need additional snapshots and parametrization as well.

All in all, LGTM! Thank you so much for addressing this and making a nice clean fix 🚀

localstack/aws/spec.py Outdated Show resolved Hide resolved
tests/aws/services/s3/test_s3.py Show resolved Hide resolved
Copy link
Member

@thrau thrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is already a huge improvement over what we had before! I do have a few open questions/comments that I'd like to understand before approving, but overall really great work!

tests/unit/aws/protocol/test_parser.py Show resolved Hide resolved
Comment on lines 476 to 480
def test_query_protocol_error_serialization():
exception = InvalidMessageContents("Exception message!")
_botocore_error_serializer_integration_test(
"sqs-query", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!"
"sqs", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is specifically a test for the query protocol, should we maybe use a service that we know is going to use an unaltered version of the query serializer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQS is used a lot across the parser tests. In fact, the parser and serializer tests should be refactored and structured in the future, but I would like to avoid adding this to this PR, if it's okay for you.

localstack/aws/protocol/service_router.py Show resolved Hide resolved
@@ -275,28 +276,32 @@ def get_service_catalog() -> ServiceCatalog:
return ServiceCatalog()


def resolve_conflicts(candidates: Set[str], request: Request):
def resolve_conflicts(candidates: Set[ServiceModel], request: Request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still passing a Set[ServiceModel] ? is ServiceModel even hashable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hashable, yes, but it's a good point. The service router has now been restructured such that the different stages are working based on service model identifiers, and the main function loads and returns the model.
So in the latest revision, this specific parameter has been changed to a set of ServiceIdentifier (a named tuple containing the service name and the protocol).

content_type = request.headers.get("Content-Type")
return "sqs" if content_type == "application/x-amz-json-1.0" else "sqs-query"
return "sqs-json" if content_type == "application/x-amz-json-1.0" else "sqs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i was under the impression we were getting rid of the virtual service names? shouldn't this then return the service model instead? if not, should we introduce some form of value type for the (service,version,protocol) tuple which should uniquely identify a service model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you input, this is how it's working now (we have a named tuple uniquely identifying a loadable service model), which is the return type of the conflict resolution / heuristic stage functions.

localstack/aws/protocol/service_router.py Outdated Show resolved Hide resolved
Comment on lines 114 to 115
@cached_property
def target_prefix_index(self) -> Dict[str, List[ServiceName]]:
def target_prefix_index(self) -> Dict[str, List[ServiceModel]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to now store ServiceModel instances in the catalog index, which is the one we serialize and save to disk. How does this affect the size of the cache file and the performance of serializing/deserializing the cache?

I would be more inclined to keep cache structures to primitives types when they are stored. So something like the service,version,protocol tuple i mentioned earlier could be useful for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point, the model caching totally slipped my mind.
I changed the index such that it now only uses ServiceModelIdentifier (a NamedTuple).
This obviously increases the size of the service catalog pickle, but just by ~225kb instead of multiple megabytes: 383,9kb (old) compared to 608,6kb (new)
This results in the following building and loading times of the service catalog:

Stage Previous Index New Index
Building 2.0 - 2.1s 1.8 - 2.2s
Loading 14 - 15ms 27 - 40ms

However, the variance was quite high for my tests (on my notebook).
But we can see that the building doesn't really take considerably longer, the loading takes double the time, but is only executed once per LocalStack run and it's an impact of max. 25ms.

localstack/aws/handlers/service.py Show resolved Hide resolved
localstack/aws/protocol/parser.py Show resolved Hide resolved
Comment on lines 74 to 129
# the service name needs to be set to standard sqs
if service == "sqs-json":
service = "sqs"
return ServiceModel(service_description, service)
Copy link
Member

@thrau thrau Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really good if we could get protocol: str as an optional parameter into our load_service implementation instead of relying on the fake service names. i understand that we currently need them under the hood to know which spec file to load, but basically our api should understand protocols as first-class citizen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. The new solution does not use the sqs-json service anymore in the code.
However, using a custom service name has a big benefits: This way it integrates really nicely with botocore and our internal AWS client factory. We can just use aws_client.sqs_json to load a client which reads the spec we ship in the localstack module. This can be really helpful in the tests, and might also come in handy for other use cases. So we would like to have two ways to load a service with a specific protocol:

  • Add the suffix to the service name (f.e. sqs-json), which enables the use case mentioned above.
  • Add the protocol to the service loading function, to have a clean and clear way to define a specific protocol of a service to load.

I also tried to avoid patching botocore, but since the spec#load_service is the only function using the custom loader, which is why this custom logic (implementing the protocol-specific loading) is contained in that function.

Copy link
Member

@thrau thrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic job @alexrashed! thanks a lot for your patience and persistence on this long standing issue! i think we found a good solution for now, and it's exciting to see how we're really pushing the boundaries of botocore and what we need from it.

well done!

@thrau thrau changed the title remove sqs-query service alias, upgrade botocore fix sqs multi-protocol handling and upgrade botocore Jan 5, 2024
@thrau thrau merged commit 41b4d23 into master Jan 5, 2024
34 checks passed
@thrau thrau deleted the remove-sqs-query-service branch January 5, 2024 20:06
@thrau
Copy link
Member

thrau commented Jan 5, 2024

merging the PR to get some data over the weekend. will keep an eye on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants