Artificial limiting of Lambda Function creation #2206

jaxxstorm · 2022-11-09T22:41:04Z

What happened?

The underlying bridged provider artificially limits the number of lambda functions that can be uploaded in parallel.

The cause of this is hashicorp/terraform#9364 and the fix was implemented in here:

To resolve this, some logic was implemented to create a mutex.Lock was added to the lambda function upload, which slows down the provisioning of lambdas. You can see the effect of this here:

https://github.com/hashicorp/terraform/pull/9667/files

Steps to reproduce

import pulumi
import pulumi_aws as aws

 for i in range(50):
     name = f'pulumi-{str(i).rjust(2,"0")}'
     aws.lambda_.Function(
         name,
         aws.lambda_.FunctionArgs(
             name=name,
             code=pulumi.FileArchive("hello-world.zip"),
             role="arn:aws:iam::616138583583:role/lbriggs-lambda-test",
             handler="hello-world.handler",
             runtime="nodejs12.x",
         ),
     )

Expected Behavior

It creates the functions in parallel

Actual Behavior

This takes around 4/5 minutes and you can observe the creation slowing down and operating in serial

Output of `pulumi about`

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

lukehoban · 2022-11-13T19:43:56Z

Note that this issue is tracking us making a decision on whether to make a different decision than upstream on this behaviour - so is not "awaiting-upstream".

The current locking incurs a very significant performance penalty in cases like the one above, and provides no way to work around. Whereas the original issue that this locking was introduced to fix is much rarer, and more within the control of the user who can apply --parallel to limit the parallelism (and thus concurrent memory usage).

mikhailshilkov · 2024-03-27T23:20:44Z

We may want to prioritize a decision here. #3740 reported the same issue here, however, StapStart+Publish settings make the code updates much longer (minutes instead of seconds), so the whole UX get much worse.

Note that the issue applies to Updates as well as Creates.

flostadler · 2024-05-22T14:26:19Z

I'm leaning towards taking a different decision than upstream and consequently removing the serialization for lambda upload.

The underlying issue that was fixed upstream (hashicorp/terraform#9364) only surfaces when dealing with many (10+) Lambdas in a memory constrained environment. Additionally, this fix is ~8 years old. Worth checking if this is still reproducible. For example, the Lambda resource in upstream was switched to the AWS SDK v2 since then.

Apart from that, there's two user facing workarounds that I can think of.

The --parallel flag @lukehoban mentioned to control the number of parallel resource operations
Uploading the lambda code to S3 using aws.s3.BucketObjectv2. This is the mandatory approach for Lambdas +50MB. Instead of loading the whole blob into memory, it streams the content to S3.

I'm going to create a prototype that removes the artificial serialization and will test the mentioned workarounds for feasability.

flostadler · 2024-05-23T10:19:37Z

After removing the artificial serialization I was able to verify that the underlying issue still exists and both workarounds mentioned above are viable.

This prototype PR includes the patch that removes the serialization as well as the necessary Pulumi programs to replicate the issue and test workarounds: #3976

I'm testing with 25MB artifacts as that's right in the middle of the max size (50MB) for direct uploads. The tests create 50 Lambdas with those artifacts.

Findings

With default settings (--parallel 48) the aws provider peaks at ~28GB virtual memory usage (~3GB RSS, indicating heavy swapping) and fails after ~20min because the requests to AWS time out (Sigv4 signature expired).
When limiting allowed parallel resource operations to 8 everything worked, but the memory usage of the AWS provider peaked at 2.8GB. pulumi up took ~15min.
When using the workaround of uploading the lambda.zip to S3 there were no limits for parallel resource operations and pulumi up took ~4.5min. The memory usage of the AWS provider stayed below 300MB throughout the operation.

For reference, the baseline without the fix took 26 minutes to complete pulumi up.

What's interesting is the memory usage when directly uploading the lambda code. 50 lambdas, 25MB each, add up to 1.25GB, which is only roughly 5% of the observed memory usage. The same could be seen in the original upstream bug report: hashicorp/terraform#9364.
I'm gonna have a look into this as there might be some unnecessary memory allocations/copying going on.

flostadler · 2024-05-23T14:24:06Z

After profiling the provider with pprof I found two main places where memory is allocated.

`awsRestjson1_serializeDocumentFunctionCode`

This converts the zip archive to a base64 encoded string https://github.com/aws/aws-sdk-go-v2/blob/main/service/lambda/serializers.go#L6178

In my test run with 50 lambdas à 25MB (1.25GB total) this allocated 3.49GB. This sounds reasonable given that this needs to:

copy the binary slice
convert it and store a new base64 string
create the request body with that base64 string

`defaultRequestBodyLogger::Log`

This is the upstream provider's HTTP logging middleware for the AWS SDK: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logging/http.go#L118

It is used to emit debug logs with the request body here: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logger.go#L140

Under the hood this creates multiple copies of the request body, which explains the high memory usage.

Even though this log message is only emitted when debug logging is turned on, the code for preparing the message is executed regardless of the log level.

This is my main suspicion right now. I'm gonna test whether removing the lambda binary from the log fixes the problem.

flostadler · 2024-05-24T11:45:40Z

After removing the request logging for Lambda the issue is fixed!

To compare the effects of removing the artificial limiting I've executed a set of tests on a c6i.2xlarge EC2 instance (each test was executed 5 times, measurements are the max). The EC2 instance does not have swap enabled (by default):

1. Baseline with artificial limiting (pulumi-aws `6.37.0`). Default `--parallel` of 32

pulumi up takes ~8.5 minutes and memory peaks at ~0.9GB RSS

2. Without artificial limiting (`d815227`). Default `--parallel` of 32

pulumi up takes ~2 minutes and memory peaks at ~7.1GB RSS
This is very interesting compared to the 28GB memory usage on my M3 MacBook mentioned before.

My current working theory is that this is driven by the lower network bandwidth locally in combination with retries.
The Sigv4 of an AWS API call expires if the call takes longer than 5 minutes. When that happens the SDK retries the call and the HTTP request logger copies the body again and logs it, leading to increased memory usage.

3. Without artificial limiting (`d815227`). `--parallel` of 48, bandwidth limited to 25Mbit/s UP and 50Mbit/s down

pulumi up fails after ~19 minutes and consumes 14.8GB RSS (93% of the instances memory).
The errors are Sigv4 signature expired. This is in line with the behavior locally on my M3 Mac. This shows that the memory issues around HTTP request logging get amplified through retries and proofs the theory in 2).

4. Without artificial limiting and Lambda HTTP body logging turned off (`0c4044c`). Default `--parallel` of 32

pulumi up takes ~1:30 minutes and memory peaks at 4.1GB RSS

5. Without artificial limiting and Lambda HTTP body logging turned off (`0c4044c`). `--parallel 8`

pulumi up takes ~1:50 minutes and memory peaks at 1.2GB RSS

flostadler · 2024-05-24T12:02:08Z

So to summarize the findings:

The high memory usage during Lambda creation/updates was indeed caused by logging the request bodies as this involves multiple copies & memory allocations.
Removing the artificial serialization of Lambda creation/updates does not lead to unusual memory consumption when the request body logging is removed. The memory usage is in line with what's expected to assemble the AWS API requests for directly uploading Lambda code (converting archive to base64 and creating request body out of that).
Using --parallel or uploading the Lambda Code to S3 gives users control over the memory usage

Based on those findings I'm proposing to fix the HTTP request logger and remove the artificial limiting of Lambda creation/updates. This should result in a drastic speed up for users with many lambdas. My tests have shown a 5.5x increase for directly uploading 25MB lambda archives. Users with smaller archives or users that upload the archives to S3 should see even bigger speed ups

flostadler · 2024-05-31T16:38:33Z

I was able to create a patch that fixes the memory issues: #3976

I'm drafting upstream PRs right now, but I'm proposing that we still go forward with the patch for the time being because this change will touch two different repos upstream (hashicorp/aws-sdk-go-base & hashicorp/terraform-provider-aws).
My hunch is that this it take a substantial amount of time to get merged in (even small patches for panics take weeks). And in case it goes faster we can remove the patch again.

@t0yv0 @mjeffryes @corymhall What do you think about that approach?

t0yv0 · 2024-05-31T16:40:16Z

Great work! I'd patch and release first and propose upstream PR, any reason to keep our users waiting? One reason would be if there is uncertainty or risk in the patch. Looking forward to reviewing this next week.

flostadler · 2024-06-04T15:38:01Z

I'd say the main risk with the patch is the maintenance effort. Given that the patch is fairly isolated (it only touches on AWS SDK middleware) I'd not keep the users waiting longer.

t0yv0 · 2024-06-04T17:19:26Z

I'd say the main risk with the patch is the maintenance effort.

If that becomes the problem we can revert right. I'm for merging.

The root cause behind why the serialization of Lambda creation/update was introduced upstream is excessive memory usage (see hashicorp/terraform#9364). After investigation we found that this is caused by the HTTP request logging middleware. It logs the lambda archive as a base64 encoded string. In order to do so, multiple copies of the body are created in memory, which leads to memory bloating. This change fixes that by redacting the body in the logs for the Create/Update Lambda calls. The PR introduces two patches. One removes the Lambda serialization and the other fixes the HTTP request logging middleware for the Lambda `CreateFunction` and `UpdateFunctionCode` operations. After this, Lambdas are created/updated in parallel and don't suffer from excessive memory usage. Users can still limit the parallelism with the CLI flag `--parallel` if they wish so. Relates to #2206

flostadler · 2024-06-11T09:14:08Z

This was fixed and released as part of v6.39.0.

jaxxstorm added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 9, 2022

lblackstone added impact/performance Something is slower than expected awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). and removed needs-triage Needs attention from the triage team labels Nov 11, 2022

lukehoban removed the awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). label Nov 13, 2022

lukehoban added kind/enhancement Improvements or new features and removed kind/bug Some behavior is incorrect or out of spec labels Nov 21, 2022

mnlumi added customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs labels Oct 3, 2023

mikhailshilkov mentioned this issue Mar 27, 2024

Lambda with SnapStart enabled and Publish=True will block deployment of other lambda functions in the same program #3740

Closed

mjeffryes assigned flostadler May 20, 2024

flostadler mentioned this issue May 23, 2024

Change Lambdas to create and update in parallel #3976

Merged

flostadler mentioned this issue Jun 7, 2024

PATCH 0061-Create-Logging-Middleware-for-Lambda-service-that-do.patch #4040

Open

flostadler added the resolution/fixed This issue was fixed label Jun 11, 2024

flostadler closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artificial limiting of Lambda Function creation #2206

Artificial limiting of Lambda Function creation #2206

jaxxstorm commented Nov 9, 2022 •

edited

Loading

lukehoban commented Nov 13, 2022

mikhailshilkov commented Mar 27, 2024 •

edited

Loading

flostadler commented May 22, 2024

flostadler commented May 23, 2024 •

edited

Loading

flostadler commented May 23, 2024 •

edited

Loading

flostadler commented May 24, 2024 •

edited

Loading

flostadler commented May 24, 2024 •

edited

Loading

flostadler commented May 31, 2024

t0yv0 commented May 31, 2024

flostadler commented Jun 4, 2024

t0yv0 commented Jun 4, 2024

flostadler commented Jun 11, 2024

Artificial limiting of Lambda Function creation #2206

Artificial limiting of Lambda Function creation #2206

Comments

jaxxstorm commented Nov 9, 2022 • edited Loading

What happened?

Steps to reproduce

Expected Behavior

Actual Behavior

Output of pulumi about

Additional context

Contributing

lukehoban commented Nov 13, 2022

mikhailshilkov commented Mar 27, 2024 • edited Loading

flostadler commented May 22, 2024

flostadler commented May 23, 2024 • edited Loading

Findings

flostadler commented May 23, 2024 • edited Loading

awsRestjson1_serializeDocumentFunctionCode

defaultRequestBodyLogger::Log

flostadler commented May 24, 2024 • edited Loading

1. Baseline with artificial limiting (pulumi-aws 6.37.0). Default --parallel of 32

2. Without artificial limiting (d815227). Default --parallel of 32

3. Without artificial limiting (d815227). --parallel of 48, bandwidth limited to 25Mbit/s UP and 50Mbit/s down

4. Without artificial limiting and Lambda HTTP body logging turned off (0c4044c). Default --parallel of 32

5. Without artificial limiting and Lambda HTTP body logging turned off (0c4044c). --parallel 8

flostadler commented May 24, 2024 • edited Loading

flostadler commented May 31, 2024

t0yv0 commented May 31, 2024

flostadler commented Jun 4, 2024

t0yv0 commented Jun 4, 2024

flostadler commented Jun 11, 2024

jaxxstorm commented Nov 9, 2022 •

edited

Loading

Output of `pulumi about`

mikhailshilkov commented Mar 27, 2024 •

edited

Loading

flostadler commented May 23, 2024 •

edited

Loading

flostadler commented May 23, 2024 •

edited

Loading

`awsRestjson1_serializeDocumentFunctionCode`

`defaultRequestBodyLogger::Log`

flostadler commented May 24, 2024 •

edited

Loading

1. Baseline with artificial limiting (pulumi-aws `6.37.0`). Default `--parallel` of 32

2. Without artificial limiting (`d815227`). Default `--parallel` of 32

3. Without artificial limiting (`d815227`). `--parallel` of 48, bandwidth limited to 25Mbit/s UP and 50Mbit/s down

4. Without artificial limiting and Lambda HTTP body logging turned off (`0c4044c`). Default `--parallel` of 32

5. Without artificial limiting and Lambda HTTP body logging turned off (`0c4044c`). `--parallel 8`

flostadler commented May 24, 2024 •

edited

Loading