Slow start-up? #263

a-h · 2022-07-24T00:55:30Z

I tried out migrating away from AWS's X-Ray SDK for Lambda, but the Open Telemetry Lambda layer appears to add a significant amount to cold start time, which I didn't expect.

It was suggested I cross post, since this is actually the repo with the layers in.

aws-observability/aws-otel-lambda#228 (comment)

Here's the data for reference:

I don't see any documentation on performance, comparison to X-Ray performance etc.

Is there a plan to reduce this? I didn't expect to have to add a Lambda layer to get OpenTelemetry working, I thought it would be included in the Lambda runtime as a first class thing, rather than being a sort-of add-on.

mhausenblas · 2022-07-26T16:10:24Z

Hi, ADOT PM here. Thanks a lot, we're already in the process to dive deep on this and I will report back once we have some shareable data (ETA: early August 2022).

adambartholomew · 2022-10-05T17:13:37Z

Is there an expected startup time for the ADOT collector and instrumentation? Running the latest nodejs layer (1.6.0:1) I am still witnessing 2+ second startups. Tested with both 1024MB and 1536MB memory.

Same test requests go from 3 seconds to 100ms after warming up:

Sample initialization:

mhausenblas · 2022-10-05T18:45:05Z

@adambartholomew we've identified the issue with cold starts and are considering ways to address this. Thanks for sharing your data points and we currently do not publish expected startup times.

sam-goodwin · 2022-12-06T05:09:02Z

Any update on this? Is there an eta for a fix? Can we expect a solution that is comparable to native EMF?

adambiggs · 2022-12-21T01:54:11Z

Also eager to hear any updates. Is there any workarounds in the meantime?

Sutty100 · 2022-12-21T11:27:49Z

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

a-h · 2022-12-21T15:00:48Z

@Sutty100 - SnapStart is Java only, and specifically only java11, so it doesn't solve anything for most people using Lambda.

RichiCoder1 · 2022-12-21T21:31:46Z

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

I believe even if/when it's expanded too, it doesn't currently support or address Lambda Extensions so it wouldn't benefit this issue right now either.

bilalq · 2023-01-29T06:35:00Z

Related issue in aws-otel-lambda repo: aws-observability/aws-otel-lambda#228

RichiCoder1 · 2023-02-27T18:19:31Z

@mhausenblas not to be too noisy, but is there any update to this? Or a plan to provide an update? This makes using the Otel layer close to a no-go for a number of latency sensitive cases.

mhausenblas · 2023-02-27T19:07:38Z

@RichiCoder1 no problem at all, yes we're working on it and should be able to share details soon. Overall, our plan is to address the issues in Q1, what we need to verify is to what extent.

tsloughter · 2023-03-03T17:06:24Z

To give some feedback, I believe this is believed to be due to auto-instrumentation. So you may be able to improve your startup now by building your own, narrower, layer.

a-h · 2023-03-03T17:30:42Z

@tsloughter - do you mean "the 200ms cold start time is caused by auto-instrumentation"?

I can't see how that could be the case. Since https://opentelemetry.io/docs/instrumentation/go/libraries/ says:

Go does not support truly automatic instrumentation like other languages today.

And the Lambda layer is written in Go.

tsloughter · 2023-03-03T17:34:23Z

@a-h ah, I didn't see any mention of the language in use. You are right, in Go there is no auto instrumentation.

disfluxly · 2023-05-03T16:37:08Z

Hey @mhausenblas - any updates on the timeline for this by chance?

sangalli · 2023-05-12T20:27:01Z

I was trying to use ADOT with lambda for nodejs+NestJS, but the auto-instrumentation performed by ADOT was adding seconds to the cold start time. @mhausenblas, please let us know if you have any updates on the timeline for this issue.

ithompson-gp · 2023-07-01T14:56:30Z

Hi, in out tests we are seeing issues with invocation slow start due to Collector extension registration (~800-2000 ms) and, on emit (POST) of telemetry from the function invocation towards the Collector extension (~200-450 ms).

The initialisation duration will, of course, drop on subsequent invocations but; the POST latency (the ~200 ms) will always remain for all invocations.

Is there are news/update on remedies for this @mhausenblas ?

(is there any suggestion from AWS on the best course of action here with Lambda; is emit via the OTEL SDK [no local Agent] to a central Collector seen as a better go-forward?)

Thanks

rapphil · 2023-07-06T05:21:57Z

Hi, how are you measuring the latency for the subsequent invocations after the initialization? Is POST the http verb or is it something else?

Since you have a test setup in place, what is the latency when you don't use a layer?

a-h · 2023-10-31T17:46:25Z

Hi @rapphil, didn't see your message in July. On the screenshot, there's a red line. Above the line is when I added the Otel layer, and the cold start increased from around 100ms to 300ms.

silpamittapalli · 2024-04-02T02:30:32Z

Hi there, We use lambda serverless workloads in Financial Services with tight execution time SLAs, which makes the overhead caused by introducing AWS ADOT or a custom extension layer for OTel SDK or OTel collector unacceptable. We are trying to cut down the cold start time by minimizing layers and using just the SDK without the collector, but looks like we won't be able to cut down the overhead to reach an acceptable level.

If others have run into similar challenges, I'd be interested in learning how you are able to workaround this and still collect distributed traces for such workloads. Thanks

sam-goodwin · 2024-04-02T02:45:41Z

@silpamittapalli the BaseLime folks have already tried to strip this down as much as possible and package as two dependencies:

It still has dependencies on these libraries though: But worth looking at. https://github.com/baselime/node-opentelemetry/blob/b3331d5040bf35ca633c3634c186a2a5304a201d/package.json#L61-L68

I think a full re-write is in order. Should be a concise Js library optimized for ESM bundling.

Ankcorn · 2024-04-02T08:34:27Z

Thanks for the shout-out @sam-goodwin

We can make it smaller but opted not to make some changes we knew we could not upstream to keep things maintainable. As it stands our OTEL setup, including the extension (so we have 0 runtime latency overhead) adds around 180ms of coldstart.

I think its possible to get sub 100ms coldstarts whilst still being based on OpenTelemetry.

There are a few dependencies that can be patched or cut that would not change behavior too much for most use cases and the I'm sure some other bits could be slimmed down a bit

@silpamittapalli if you want to chat through your use case I'd be happy to help with this :)

On doing a complete optimized rewrite - it's easy to underestimate how much work has gone into OTEL and how much it provides. It is a general solution though so not optimized for lambda or other environments that prioritize a quick startup.

Here is our bundle - it's easy to see how much we have done vs how much we rely on the work by the Opentelemetry team.

There are some quick wins in there like semver could be replaced with something purpose built and just a few kb and semantic attributes could be tree shaken better. I suspect resources and the resource detectors can also be improved too. but then the rest will be quite hard

bhaskarbanerjee · 2024-04-02T15:01:06Z

Has any memory profiling been done for this lambda layer? Any recco with Node v12, v16, v18 or v20? How much is the memory overhead for using this layer?

silpamittapalli · 2024-04-04T01:34:20Z

Thank you @sam-goodwin @Ankcorn. @bhaskarbanerjee from my team tried out Baselime but we haven't had any success with it yet which is probably bcoz it is customized for their proprietary software. We are trying out few other approaches 1) manual instrumentation to eliminate layers altogether and 2) minimizing SDK and/or layer by stripping unused code/dependencies

bhaskarbanerjee · 2024-04-08T02:31:02Z

Has anyone here used protobuf/http exporter and compared the performance with that of grpc exporter?
Both for Lambda cold start time and response time?

Ref https://github.com/open-telemetry/opentelemetry-lambda/blob/main/nodejs/packages/layer/src/wrapper.ts#L24
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto' seems to be very fast but
if we do import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc' that seems to be taking atleast 100ms more. Seeking your advice.

matthias-pichler-warrify mentioned this issue Jan 2, 2023

Sending traces to Datadog using OpenTelemetry from a AWS Lambda DataDog/datadog-agent#11889

Open

RichiCoder1 mentioned this issue Feb 27, 2023

ADOT lambda layer with trace exporter adds ~130 ms to billed duration for each lambda invocation #493

Closed

rapphil mentioned this issue May 30, 2023

Improving lambda Cold start #727

Open

bilalq mentioned this issue Jul 7, 2023

Expose more data in middleware to improve X-Ray and OTel traces aws/aws-sdk-js-v3#4902

Closed

2 tasks

bilalq mentioned this issue Aug 15, 2023

Improve support for AWS SDK v3 aws/aws-xray-sdk-node#547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow start-up? #263

Slow start-up? #263

a-h commented Jul 24, 2022

mhausenblas commented Jul 26, 2022

adambartholomew commented Oct 5, 2022

mhausenblas commented Oct 5, 2022

sam-goodwin commented Dec 6, 2022

adambiggs commented Dec 21, 2022

Sutty100 commented Dec 21, 2022 •

edited

a-h commented Dec 21, 2022

RichiCoder1 commented Dec 21, 2022 •

edited

bilalq commented Jan 29, 2023

RichiCoder1 commented Feb 27, 2023

mhausenblas commented Feb 27, 2023

tsloughter commented Mar 3, 2023

a-h commented Mar 3, 2023

tsloughter commented Mar 3, 2023

disfluxly commented May 3, 2023

sangalli commented May 12, 2023

ithompson-gp commented Jul 1, 2023 •

edited

rapphil commented Jul 6, 2023

a-h commented Oct 31, 2023

silpamittapalli commented Apr 2, 2024

sam-goodwin commented Apr 2, 2024

Ankcorn commented Apr 2, 2024 •

edited

bhaskarbanerjee commented Apr 2, 2024

silpamittapalli commented Apr 4, 2024

bhaskarbanerjee commented Apr 8, 2024

Slow start-up? #263

Slow start-up? #263

Comments

a-h commented Jul 24, 2022

mhausenblas commented Jul 26, 2022

adambartholomew commented Oct 5, 2022

mhausenblas commented Oct 5, 2022

sam-goodwin commented Dec 6, 2022

adambiggs commented Dec 21, 2022

Sutty100 commented Dec 21, 2022 • edited

a-h commented Dec 21, 2022

RichiCoder1 commented Dec 21, 2022 • edited

bilalq commented Jan 29, 2023

RichiCoder1 commented Feb 27, 2023

mhausenblas commented Feb 27, 2023

tsloughter commented Mar 3, 2023

a-h commented Mar 3, 2023

tsloughter commented Mar 3, 2023

disfluxly commented May 3, 2023

sangalli commented May 12, 2023

ithompson-gp commented Jul 1, 2023 • edited

rapphil commented Jul 6, 2023

a-h commented Oct 31, 2023

silpamittapalli commented Apr 2, 2024

sam-goodwin commented Apr 2, 2024

Ankcorn commented Apr 2, 2024 • edited

bhaskarbanerjee commented Apr 2, 2024

silpamittapalli commented Apr 4, 2024

bhaskarbanerjee commented Apr 8, 2024

Sutty100 commented Dec 21, 2022 •

edited

RichiCoder1 commented Dec 21, 2022 •

edited

ithompson-gp commented Jul 1, 2023 •

edited

Ankcorn commented Apr 2, 2024 •

edited