Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document testing to reduce lambda cold start time #693

Merged
merged 1 commit into from
Jul 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 9 additions & 5 deletions docs/testing/00-notification-improvement-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@ Current ideas and initiatives for the fine-tuning phase:
- [Increasing shard size](./01-shard-size.md)
- [RDS proxy](./02-rds-proxy.md)
- Aurora and postgres 14
- Query performance improvements
- ~~Handling DB connection errors more gracefully~~ parked for now since errors reduced after DB vacuum (including re-indexing)
- Lambda concurrency
- [Query performance improvements ](../architecture/02-database-tuning.md)
- ~~Handling DB connection errors more gracefully~~ parked for now since errors reduced after DB vacuum (including re-indexing)
- [Lambda concurrency](./04-reducing-lambda-duration.md)
- [Reducing lambda execution time](./06-reducing-lambda-cold-start-times.md)
- Change batch size for sender lambdas (increase throughput to apns/fcm)

## Product improvements

Expand All @@ -36,6 +38,8 @@ Ideas for potential product improvements that would likely require some integrat

Current ideas for re-architecting:

- [Microsplinters](https://github.com/itsibitzi/n10n-poc/blob/main/n10n-broker/src/main.rs): preshard the DB and use these for querying
- [Microsplinters](https://github.com/itsibitzi/n10n-poc/blob/main/n10n-broker/src/main.rs): preshard the DB and use these for querying
- Using Go to handle concurrency
- Using ZIO to handle concurrency
- Using ZIO to handle concurrency
- Change lambda runtime: there are [studies](https://filia-aleks.medium.com/aws-lambda-battle-2021-performance-comparison-for-all-languages-c1b441005fd1) demonstrating that jvm-based lambdas are the worst performing, we could get benefits from migrating to typescript lambdas (or go, rust)
- Pre-cache breaking news tokens: create a cron job that queries db to grab tokens for our breaking news topics to avoid these queries being executed as part of our send flow. Different flow for breaking news notifications: when requested we skip the harvester and immediately populate the sender queues with tokens to send.
128 changes: 128 additions & 0 deletions docs/testing/06-reducing-lambda-cold-start-times.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Reducing Lambda Cold Start Times

The hypothesis we tested was that by:
- minimising lambda dependencies, and
- reverting to deploying lambdas with a .zip archive, and
- using the latest aws sdk (v2), and
- compiling for the latest java version (java 11 corretto), and
- enabling only level 1 tiered compilation

then the result would be reduced lambda start up and execution times.

## Summary

Splitting out the ios (and android) sender lambdas reduced the size of the assembled code to ~80MB.

Given the reduced size of the assembled code we could revert to deploying the lambda with a zip archive (as opposed to a container image). This resulted in a 0.7s saving during cold starts.

Migrating to the latest version of the aws java sdk (v2) resulted in further 0.5s reduction in cold start time.

Compiling for java 11 and specifying java 11 (corretto) as the runtime for our experimental lambda resulted in no noticeable improvement in cold start or execution times.

The changes weren't simple to test and were done quickly and roughly to get a sense for any potential improvements.

Given the potential effort required to split up the existing scala lambda vs the potential gain we think the better option would be to migrate the lambdas to typescript (or another more performant language).

As a potential health task we could consider migrating our existing `notificationworkerlambda` to v2 of the aws sdk as this would more easily bring some (limited) performance improvements.

## Test Results

We experimented with the ios and android sender lambdas. We manually created a new lambda function for testing purposes only (once a lambda has been created using container images it cannot be converted to deploy using a .zip archive).

### Deploying Lambdas with a .zip archive

In order to deploy a lambda with a .zip archive we needed to reduce the artifact size.

The sender lambda code was extracted from the `notificationworkerlambda` module. Initially the experimental `workerlambda` relied on `common` but the resulting .zip exceeded the size limit accepted by lambda (250MB).


To eliminate the dependency on `common` some classes were copied across (along with some necessary dependencies) and the new project definition could be defined:

```scala
lazy val workerlambda = lambda("workerlambda", "workerlambda", None)
.settings(
libraryDependencies ++= Seq(
"com.typesafe.play" %% "play-json" % playJsonVersion,
"com.typesafe.play" %% "play-json-joda" % playJsonVersion,
"ai.x" %% "play-json-extensions" % "0.42.0",
"com.squareup.okhttp3" % "okhttp" % okHttpVersion,
"com.amazonaws" % "aws-lambda-java-events" % "2.2.8",
"com.amazonaws" % "aws-java-sdk-cloudwatch" % awsSdkVersion,
"com.google.firebase" % "firebase-admin" % "8.1.0",
"com.turo" % "pushy" % "0.13.10",
"software.amazon.awssdk" % "sqs" % "2.17.239",
"org.typelevel" %% "cats-effect" % "2.5.1",
"co.fs2" %% "fs2-core" % "2.5.6",
),
excludeDependencies ++= Seq(
ExclusionRule("com.typesafe.play", "play-ahc-ws_2.13")
),
)
```

NB: the `lambda` project has only minimal dependencies for logging, aws-lambda-java-core and aws simple configuration.

The resulting .jar from this project was 81.4MB and was uploaded to s3. It was then specified as a source for our manually created lambda.

We created a test event to successfully trigger the lambda. The results below show the difference in timings:

| |Init Duration (ms)|Duration (ms)|
|:----|:----|:----|
|ECR|4777|2893|
|ZIP|4015|2665|

We can see a marginal improvement, but perhaps not worth the engineering effort to achieve this.

### Migrating to AWS SDK v2

We migrated the experimental sender lambda to use version 2 of the java aws sdk as it was [indicated](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-improve-java-function-performance/) that the latest library would bring about performance improvements.

After migrating to version 2 (where we could, because no new version existed for `aws-lambda-java-events`) this was the resulting project definition:

```scala
lazy val workerlambda = lambda("workerlambda", "workerlambda", None)
.settings(
libraryDependencies ++= Seq(
"com.typesafe.play" %% "play-json" % playJsonVersion,
"com.typesafe.play" %% "play-json-joda" % playJsonVersion,
"ai.x" %% "play-json-extensions" % "0.42.0",
"com.squareup.okhttp3" % "okhttp" % okHttpVersion,
"software.amazon.awssdk" % "lambda" % "2.17.239",
"com.amazonaws" % "aws-lambda-java-events" % "2.2.8",
"software.amazon.awssdk" % "cloudwatch" % "2.17.239",
"com.google.firebase" % "firebase-admin" % "8.1.0",
"com.turo" % "pushy" % "0.13.10",
"software.amazon.awssdk" % "sqs" % "2.17.239",
"org.typelevel" %% "cats-effect" % "2.5.1",
"co.fs2" %% "fs2-core" % "2.5.6",
),
excludeDependencies ++= Seq(
ExclusionRule("com.typesafe.play", "play-ahc-ws_2.13")
)
)
```

The change slightly increased the artifact size (to 83.3MB) but it could still be specified as the source for our lambda. The results below show the difference in timings:

| |Init Duration (ms)|Duration (ms)|
|:----|:----|:----|
|AWS SDK v1|4015|2665|
|AWS SDK v2|3575|2488|

It might be a good health task to migrate our `notificationworkerlambda` to the aws sdk v2 (without also breaking out the individual lambdas) as we may get a small performance benefit as a result.

### Upgrading to Java 11

The build.sbt for mobile-n10n does not define any jvm options. We set java 11 on our local machines and compiled + assembled the experimental lambda, as well as updating the runtime defined for the function in the aws console.

When testing we did not observe any noticeable difference in the cold start or execution time.

### Restricting Tiered Compilation Setting

There is interesting [information](https://aws.amazon.com/blogs/compute/increasing-performance-of-java-aws-lambda-functions-using-tiered-compilation/) about how to reduce the cold start times of jvm-based lambdas. The suggestion is to use tiered compilation to reduce the cold start time of the lambda.

By only allowing 'level 1' compilation (no profiling) it reduces the amount of profiling and therefore the time before code can be executed.

To implement this we needed a lambda 'layer' to define this setting. We couldn't apply a 'layer' because of size restrictions on lambdas (the size restriction applied to the unzipped package size).

As a result, we couldn't easily test whether this setting had a measurable impact on cold start time.