Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSK Rolling Upgrade Continuously Retries if Partition Count > MSK Limit #17332

Open
james-bjss opened this issue Jan 28, 2021 · 7 comments
Open
Labels
enhancement Requests to existing resources that expand the functionality or scope. service/kafka Issues and PRs that pertain to the kafka service. upstream Addresses functionality related to the cloud provider.

Comments

@james-bjss
Copy link
Contributor

james-bjss commented Jan 28, 2021

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform -v
Terraform v0.14.5
+ provider registry.terraform.io/hashicorp/aws v3.20.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

  • aws_msk_cluster

Terraform Configuration Files

resource "aws_msk_cluster" "example" {
  cluster_name           = "example"
  kafka_version          = "2.4.1" # After creating more partitions than the Upgrade limit change to 2.5.1 and reapply
  number_of_broker_nodes = 3
  ...

Debug Output

Gist with relevant logs

Expected Behavior

The apply should fail early indicating that the upgrade can't be performed due to the High Partition count.

Actual Behavior

The PUT call to /v1/clusters/clusterArn/version fails with a HTTP 429 X-Amzn-Errortype: HighPartitionCountException
TF output reports that it is retrying (x25).

Steps to Reproduce

  1. Deploy MSK Cluster with kafka_version="2.4.1.1" via TF
  2. Create Topics and Partition exceeding the Upgrade limits on brokers: (See Limits on Upgrade endpoint)
  3. Update kafka_version to 2.5.1 and apply to trigger upgrade

Important Factoids

  • Debug shows that the call to /v1/clusters/clusterArn/version returns a HTTP 429 (429 Too Many Requests)
  • The endpoint responds with X-Amzn-Errortype: HighPartitionCountException however I am not sure a 429 code is the correct code in this instance, so this could be an issue on the AWS API side.
  • Due to the HTTP 429 Code TF will continuously retry the call

There may be an argument to say it should retry if the partition count drops, but in my opinion I would rather the apply fail early with an indication of the actual error . In theory TF is honoring the 429 response by retrying, but should it?

References

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#bestpractices-right-size-cluster
https://docs.aws.amazon.com/msk/1.0/apireference/clusters-clusterarn-version.html#clusters-clusterarn-versionput
https://docs.aws.amazon.com/msk/latest/developerguide/limits.html

@ghost ghost added the service/kafka Issues and PRs that pertain to the kafka service. label Jan 28, 2021
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jan 28, 2021
@Farzad-Jalali
Copy link

I got the exact same problem!

@james-bjss
Copy link
Contributor Author

james-bjss commented Jan 30, 2021

Have also raised this to AWS support who have escalated to MSK Team, to confirm if the 429 response is expected behavior.

@james-bjss
Copy link
Contributor Author

Update on the above. AWS MSK team are reviewing the 429 response code and may remediate this, but no dates have been given.

@breathingdust breathingdust added upstream Addresses functionality related to the cloud provider. enhancement Requests to existing resources that expand the functionality or scope. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 16, 2021
@marcincuber
Copy link

@james-bjss any updates on this?

@james-bjss
Copy link
Contributor Author

james-bjss commented Oct 15, 2021

@james-bjss any updates on this?

Hi @marcincuber - Unfortunately I never got a response back from AWS support on this. It was passed on to the MSK team and the ticket closed. In theory it could be handled in the provider by checking for the specific header it returns, but not sure if the team would want to put the workaround in code.

Have you had this issue recently? I haven't retested so it's entirely possible it could have been resolved upstream

@marcincuber
Copy link

@james-bjss I haven't tested it. However, I will be starting work on Kafka this week. This is an interesting issue that you mentioned here so I will definitely check whether I can reproduce.

@Pekinek
Copy link

Pekinek commented Nov 3, 2023

I also contacted AWS support about this issue and changing 429 to something else was added to their backlog - no ETA though.

"Thank you for providing the change request. I have added this to the backlog and it will be prioritized accordingly."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Requests to existing resources that expand the functionality or scope. service/kafka Issues and PRs that pertain to the kafka service. upstream Addresses functionality related to the cloud provider.
Projects
None yet
Development

No branches or pull requests

5 participants