Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_api_gateway_deployment + Lambda Race Condition HTTP500, 5s pause workaround #17604

Open
namachieli opened this issue Feb 12, 2021 · 2 comments
Labels
bug Addresses a defect in current functionality. service/apigateway Issues and PRs that pertain to the apigateway service.

Comments

@namachieli
Copy link

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

$ terraform -v
Terraform v0.14.6
+ provider registry.terraform.io/hashicorp/aws v3.27.0
+ provider registry.terraform.io/hashicorp/null v3.0.0
+ provider registry.terraform.io/hashicorp/time v0.6.0
+ provider registry.terraform.io/integrations/github v4.4.0

Affected Resource(s)

  • aws_api_gateway_deployment

Terraform Configuration Files

Since funkiness with aws_api_gateway_deployment is well known, I'm just adding the relevant bit.

resource "aws_api_gateway_integration" "POST" {
  cache_key_parameters    = []
  connection_type         = "INTERNET"
  content_handling        = "CONVERT_TO_TEXT"
  http_method             = "POST"
  integration_http_method = "POST"
  passthrough_behavior    = "WHEN_NO_TEMPLATES"
  request_parameters      = {}
  resource_id             = aws_api_gateway_resource.test.id
  rest_api_id             = aws_api_gateway_rest_api.test.id
  timeout_milliseconds    = 29000
  type                    = "AWS"
  uri = aws_lambda_function.test.invoke_arn
  request_templates = {
    "application/json" = <<-EOT
     <...>
    EOT
  }
}

<...>

resource "aws_api_gateway_deployment" "api" {
  rest_api_id = aws_api_gateway_rest_api.test.id
  triggers = {
    redeployment = sha1(jsonencode([
      aws_api_gateway_resource.test.id,
      aws_api_gateway_method.POST.id,
      aws_api_gateway_method.OPTIONS.id,
      aws_api_gateway_integration.POST.id,
      aws_api_gateway_integration.OPTIONS.id
    ]))
  }
  lifecycle {
    create_before_destroy = true
  }
  depends_on = [
    aws_api_gateway_integration.POST,
    aws_api_gateway_integration.OPTIONS,
    aws_api_gateway_method.POST,
    aws_api_gateway_method.OPTIONS,
    aws_api_gateway_integration_response.POST-200,
    aws_api_gateway_integration_response.OPTIONS-200,
  ]
}

resource "aws_api_gateway_stage" "api" {
  cache_cluster_enabled = false
  deployment_id         = aws_api_gateway_deployment.api.id
  rest_api_id           = aws_api_gateway_rest_api.test.id
  stage_name            = "api"
  xray_tracing_enabled  = false
}

resource "aws_api_gateway_method_settings" "api" {
  rest_api_id = aws_api_gateway_rest_api.test.id
  stage_name  = aws_api_gateway_stage.api.stage_name
  method_path = "*/*"
  settings {
    throttling_burst_limit = 5000
    throttling_rate_limit  = 10000
    metrics_enabled = true
  }
}

Expected Behavior

The API Gateway should deploy the stage, and the invoke URL works completely to trigger the backend lambda, and return a HTTP200 to the client.

Output from workaround

body='{"id":"80...22","token":"aW5...kc5","type":1,"user":{"avatar":"ea...b6","discriminator":"2551","id":"24...93","public_flags":0,"username":"Na..."},"version":1}'

edsig='322...40f'
ts='161...'
invoke_url='https://7...0.execute-api.us-west-2.amazonaws.com/{stage}/{resource}'

curl -i -X POST \
> -H 'accept: */*' \
> -H "Content-Type: application/json" \
> -H "x-signature-ed25519: ${edsig}" \
> -H "x-signature-timestamp: ${ts}" \
> -d ${body} ${invoke_url}
HTTP/2 200
date: Fri, 12 Feb 2021 21:10:38 GMT
content-type: application/json
content-length: 11
x-amzn-requestid: e47...403
x-amz-apigw-id: app...A=
x-amzn-trace-id: Root=1-6...1b;Sampled=0

{"type": 1}

Actual Behavior

Invoking the stage's invoke URL correctly passes the BODY of the request to the lambda and is processed correctly by lambda. (evidenced by cloudwatch logs and lambda outputs). However, the invoking client receives an HTTP 500.

Output before workaround

body='{"id":"80...22","token":"aW5...kc5","type":1,"user":{"avatar":"ea...b6","discriminator":"2551","id":"24...93","public_flags":0,"username":"Na..."},"version":1}'

edsig='322...40f'
ts='161...'
invoke_url='https://7...0.execute-api.us-west-2.amazonaws.com/{stage}/{resource}'

curl -i -X POST \
> -H 'accept: */*' \
> -H "Content-Type: application/json" \
> -H "x-signature-ed25519: ${edsig}" \
> -H "x-signature-timestamp: ${ts}" \
> -d ${body} ${invoke_url}
HTTP/2 500
date: Fri, 12 Feb 2021 21:10:36 GMT
content-type: application/json
content-length: 36
x-amzn-requestid: be...992
x-amzn-errortype: InternalServerErrorException
x-amz-apigw-id: app...2Q=

{"message": "Internal server error"}

Steps to Reproduce

  • Apply a fairly standard Lambda/APIGateway config on a clean environment
  • Use aws_api_gateway_deployment and aws_api_gateway_stage to deploy the api stage
  • attempt to trigger the lambda with curl, receive HTTP500
  • in the AWS WebGui, deploy the same stage again with no other changes.
  • attempt to trigger the lambda with curl, receive HTTP200

You can easily toggle the deployment from TF and the manual deployment in API > Stages > Deployment History

Why is this a race condition?

The problem isn't about what terraform attempts to create, its WHEN it attempts to create it. By manually deploying after terraform apply, you are just doing the same thing as TF did, except every resource has been fully built and linked internally within AWS.

A workaround is to simply add:

resource "time_sleep" "wait" {
  create_duration = "5s"
  depends_on = [
    aws_api_gateway_integration.xxx,
    aws_api_gateway_method.xxx,
    aws_api_gateway_integration_response.xxx,
  ]
}

resource "aws_api_gateway_deployment" "api" {
<...>
  depends_on = [
    time_sleep.wait
  ]
}

This 5s pause allows something on the AWS backend to finish existing in time for the deployment to build correctly. There is likely a lot of other background issues contributing to this, but its easy to call it a race condition since its solvable with a pause.

I also tried moving the dependency logic to the deployment happens normally, but have the time delay gate the aws_api_gateway_stage resource. This always results in the race condition failure so I strongly believed its tied to aws_api_gateway_deployment

References

This lambda and the effective TF config is based on the POC from https://oozio.medium.com/serverless-discord-bot-55f95f26f743.

I am willing to provide a sanitized complete TF if required.

@ghost ghost added the service/apigateway Issues and PRs that pertain to the apigateway service. label Feb 12, 2021
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Feb 12, 2021
@justinretzolk justinretzolk added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 13, 2021
Copy link

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

@github-actions github-actions bot added the stale Old or inactive issues managed by automation, if no further action taken these will get closed. label Dec 26, 2023
@namachieli
Copy link
Author

Unless this has been solved, I think this issue should stay open.

@github-actions github-actions bot removed the stale Old or inactive issues managed by automation, if no further action taken these will get closed. label Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. service/apigateway Issues and PRs that pertain to the apigateway service.
Projects
None yet
Development

No branches or pull requests

2 participants