New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix sqs multi-protocol handling and upgrade botocore #9710
Conversation
f01ddec
to
cb8f2cf
Compare
0ebd655
to
ab02b47
Compare
c092a3c
to
3b792e5
Compare
7619561
to
09a4c3f
Compare
LocalStack Community integration with Pro 2 files 2 suites 1h 17m 41s ⏱️ Results for commit 72d8615. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is quite the undertaking. Many thanks for tackling this! I will only comment on the sqs provider side of things, where I saw one thing I wanted to clarify (which happens with multiple tests, so my comments all address the same potential issue)
assert "Messages" in result | ||
assert result["Messages"] == [] | ||
assert "Messages" not in result or result["Messages"] == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this isn't the most pressing issue here, just as a remark: as the test is aws_validated, it seems odd that "Messages" either doesn't exist or is an empty list. So I guess this differs between the two protocols? In that case, is there a reasonably easy way to check according to which protocol is tested right now? Because right now both values are valid for both protocols. Or do/did we accept this as limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For empty lists, AWS often omits the field entirely, independent of the protocol. I noticed this behavior recently while parity testing different SDK versions (which use different protocols) in the Lambda multiruntime tests.
I moved my Notion testing page to the SQS service to demonstrate the behavior for the listQueues
operation (see here). It looks like we have an SQS parity issue when listing empty queues (unrelated to this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! Since this is tested in dozens of the snapshot tests (specifically for each individual protocol due to the new fixture being used in the tests), I wanted to make these assertions independent of the protocol. Also because the next switch is around the corner (I'm quite sure the botocore
team will try another run on the switch somewhat soon).
Also, it seems like they are currently actually changing the behavior server-side (the json
service is not returning empty lists anymore). But a gain, I think this is something which should belong in the snapshot verification. Here we actually semantically want to check if there are no messages left (and not explicitly want to test the protocol).
I'll close the other comments concerning this, but I'm happy to discuss this. If we want to change this, then obviously I'll apply it everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is an extensive set of changes. Biggest takeaway the clean separation between the protocol and the service name, made possible with the nice changes in the service router!
We've also had a sync looking at a lot of test changes, and it looks great! I like the parametrization making sure we properly test both protocols.
There's a lot of clean up in here, and it seems #9828 has been merged, which would maybe need additional snapshots and parametrization as well.
All in all, LGTM! Thank you so much for addressing this and making a nice clean fix 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is already a huge improvement over what we had before! I do have a few open questions/comments that I'd like to understand before approving, but overall really great work!
def test_query_protocol_error_serialization(): | ||
exception = InvalidMessageContents("Exception message!") | ||
_botocore_error_serializer_integration_test( | ||
"sqs-query", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!" | ||
"sqs", "SendMessage", exception, "InvalidMessageContents", 400, "Exception message!" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is specifically a test for the query protocol, should we maybe use a service that we know is going to use an unaltered version of the query serializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQS is used a lot across the parser tests. In fact, the parser and serializer tests should be refactored and structured in the future, but I would like to avoid adding this to this PR, if it's okay for you.
@@ -275,28 +276,32 @@ def get_service_catalog() -> ServiceCatalog: | |||
return ServiceCatalog() | |||
|
|||
|
|||
def resolve_conflicts(candidates: Set[str], request: Request): | |||
def resolve_conflicts(candidates: Set[ServiceModel], request: Request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we still passing a Set[ServiceModel]
? is ServiceModel
even hashable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hashable, yes, but it's a good point. The service router has now been restructured such that the different stages are working based on service model identifiers, and the main function loads and returns the model.
So in the latest revision, this specific parameter has been changed to a set of ServiceIdentifier
(a named tuple containing the service name and the protocol).
content_type = request.headers.get("Content-Type") | ||
return "sqs" if content_type == "application/x-amz-json-1.0" else "sqs-query" | ||
return "sqs-json" if content_type == "application/x-amz-json-1.0" else "sqs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i was under the impression we were getting rid of the virtual service names? shouldn't this then return the service model instead? if not, should we introduce some form of value type for the (service,version,protocol) tuple which should uniquely identify a service model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you input, this is how it's working now (we have a named tuple uniquely identifying a loadable service model), which is the return type of the conflict resolution / heuristic stage functions.
localstack/aws/spec.py
Outdated
@cached_property | ||
def target_prefix_index(self) -> Dict[str, List[ServiceName]]: | ||
def target_prefix_index(self) -> Dict[str, List[ServiceModel]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to now store ServiceModel
instances in the catalog index, which is the one we serialize and save to disk. How does this affect the size of the cache file and the performance of serializing/deserializing the cache?
I would be more inclined to keep cache structures to primitives types when they are stored. So something like the service,version,protocol
tuple i mentioned earlier could be useful for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point, the model caching totally slipped my mind.
I changed the index such that it now only uses ServiceModelIdentifier
(a NamedTuple
).
This obviously increases the size of the service catalog pickle, but just by ~225kb instead of multiple megabytes: 383,9kb (old) compared to 608,6kb (new)
This results in the following building and loading times of the service catalog:
Stage | Previous Index | New Index |
---|---|---|
Building | 2.0 - 2.1s | 1.8 - 2.2s |
Loading | 14 - 15ms | 27 - 40ms |
However, the variance was quite high for my tests (on my notebook).
But we can see that the building doesn't really take considerably longer, the loading takes double the time, but is only executed once per LocalStack run and it's an impact of max. 25ms.
localstack/aws/spec.py
Outdated
# the service name needs to be set to standard sqs | ||
if service == "sqs-json": | ||
service = "sqs" | ||
return ServiceModel(service_description, service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really good if we could get protocol: str
as an optional parameter into our load_service
implementation instead of relying on the fake service names. i understand that we currently need them under the hood to know which spec file to load, but basically our api should understand protocols as first-class citizen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. The new solution does not use the sqs-json
service anymore in the code.
However, using a custom service name has a big benefits: This way it integrates really nicely with botocore
and our internal AWS client factory. We can just use aws_client.sqs_json
to load a client which reads the spec we ship in the localstack module. This can be really helpful in the tests, and might also come in handy for other use cases. So we would like to have two ways to load a service with a specific protocol:
- Add the suffix to the service name (f.e.
sqs-json
), which enables the use case mentioned above. - Add the protocol to the service loading function, to have a clean and clear way to define a specific protocol of a service to load.
I also tried to avoid patching botocore
, but since the spec#load_service
is the only function using the custom loader, which is why this custom logic (implementing the protocol-specific loading) is contained in that function.
2d78884
to
6fa2171
Compare
f9b30e2
to
005a20a
Compare
01734d5
to
b14b789
Compare
b14b789
to
e151e88
Compare
e151e88
to
5afd969
Compare
5afd969
to
72d8615
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fantastic job @alexrashed! thanks a lot for your patience and persistence on this long standing issue! i think we found a good solution for now, and it's exciting to see how we're really pushing the boundaries of botocore and what we need from it.
well done!
merging the PR to get some data over the weekend. will keep an eye on it! |
Motivation
This PR aims at cleaning up after the quick fixes with the protocol switch in SQS of
botocore
fromquery
tojson
(and now back again).These changes of the protocol are highly problematic, because this means that clients are suddenly changing the way how they talk to AWS (and in turn also with LocalStack). There's nothing we can do on our side besides implementing the new protocol (in addition to the old protocol to remain downwards-compatible with other or older clients).
Here's a timeline of what happened:
botocore
switched the protocol for SQS fromquery
tojson
in version1.29.127
(specifically this commit: boto/botocore@2395012).botocore
reverts the change with Revert SQS protocol changes boto/botocore#2931 in version1.29.128
.botocore
again performs the switch in version1.31.81
(specifically this commit: boto/botocore@84c7847).botocore
changes the exception handling for SQS in version1.31.85
in Fix Sqs ErrorType not being parsed properly boto/botocore#3054.botocore
switches to gzipped specs which is not compatible with moto at the time.botocore
as a first response.moto
such that it contains Techdebt: Fix compatibility with botocore 1.32.1 getmoto/moto#7030.botocore
againbotocore
switches the protocol for SQS back fromjson
toquery
in version1.32.6
(specifically this PR: Revert SQS protocol changes boto/botocore#3071)botocore
as a first response.This is where we are at. With this PR we want to:
sqs-query
should only be used in very specific sections explicitly handling thequery
spec.botocore
, effectively switching back toquery
as the default protocol, but still supportjson
.Dockerfile
introduced with pin global awscli install in Dockerfile #9767.Changes
sqs-query
service "alias", and removes all special cases handling this service alias.serializer
andparser
such that they differentiate based on the protocol of the given service model (instead of only based on the service).service_router
such that it delivers a service model instead of only the service name (because that would be ambiguous, there are two service models for SQS).botocore
version (since they had to be hold back, see Update ASF APIs #9735).botocore
.json
back toquery
as default protocol for SQS.query
tojson
(the "non-default" protocol).serializer
adjustments which are necessary due to the duality of the protocols, their divergence, and the fact that we only generate a single ASF stub based on one of the specs.