Reapply "Add valkey cache handler to next app (#1210)" (#1215) by kalilsn · Pull Request #1220 · knowledgefutures/pubplatform

kalilsn · 2025-04-30T21:16:40Z

Issue(s) Resolved

Resolves #1131

High-level Explanation of PR

This PR changes core to use valkey as its cache, as opposed to the filesystem. Theoretically it should be automatically instrumented by sentry since we're using a supported redis client.

I haven't yet set a max size, because I'm really not sure what it should be. IMO we shouldn't really worry about that yet, since the large values are a problem for the node server which can't gracefully run out of memory (which we should fix), whereas valkey will simply evict keys. Once we have some data about how the cache is performing, we can tune that configuration more.

Test Plan

Start the app without valkey running. Ensure that it still serves requests.
Start the app with valkey running. In a separate terminal, run redis-cli monitor. Send a few requests and verify that you see valkey being used. Stop valkey (pnpm -w dev:cache:stop). Confirm the app still serves requests.
not yet working without a manual restart of the app: After either of the above scenarios, when valkey is stopped, restart valkey without restarting the app. New requests should use the valkey cache.

This reverts commit 0951e72.

tefkah

if it would be possible to use a fallback i think that would be ideal to decrease possible downtime/manual fiddling! ideal behavior to me in case it can't reach valkey

fallback to file system caching for now (or no caching, that's also fine)
try to reconnect in the meantime
if connection established, start using valkey cache

tefkah

whoops, i meant to request changes! if the behavior described above turns out not to be possible, i think at least creating a sentry issue would be nice (maybe it already does that if you throw)

problems fixed!

kalilsn · 2025-05-05T23:10:55Z

+			redisClient = new Redis({
+				host: process.env.VALKEY_HOST,
+				lazyConnect: true,
+				commandTimeout: 1000,
+				retryStrategy: (times) => {
+					console.log("Retrying redis connection attempt:", times);
+					return (2 ^ times) + Math.random() * 1000;
+				},
+			});
+
+			await redisClient.connect();


I'd love to reuse the singleton client from lib/redis but I can't seem to import it in this file (lib/redis.ts gets inlined/chunked somewhere, and the build process probably doesn't look at imports in this file). So we have to repeat some of the connection logic + params, and we have two clients per instance, but at least it's not more!

kalilsn · 2025-05-05T23:18:20Z

 /**
- * Solution taken from here: https://github.com/vercel/next.js/discussions/48324#discussioncomment-10542097
+ * Creates a redis handler based on fortedigital's redis-strings handler, but using the ioredis
+ * client


the node-redis client does not handle reconnecting well at all and it seems like many people have reported this issue for many years, with no fix. so I switched to ioredis, which both handles reconnection properly and has much clearer documentation. unfortunately @nescha/cache-handler and the @fortedigital handlers are tightly coupled to node-redis, so I've inlined a compatible version of the redis-strings handler here. the main changes were dropping the client.isReady check (which doesn't do anything), removing the per command timeouts (because that's set on the client), and replacing the method calls with lowercase versions to match the ioredis api. see 4346195 for the exact diff from the fortedigital handler

i spoke too soon. there were more differences (84543b3) and i also realized there was a use to the ready check (d603548)

kalilsn · 2025-05-05T23:19:33Z

 COPY --from=withpackage --chown=node:node /usr/src/app/core/.env.docker ./core/.env

-CMD node core/server.js
+CMD ["node", "core/server.js"]


Apparently using non-json arguments here can prevent the server from responding to signals sent to the container

kalilsn · 2025-05-06T04:13:49Z

This works much better after switching to ioredis. The client will attempt to reconnect automatically, and it will simply fall back to uncached behavior when the cache is unreachable. Unfortunately I couldn't get Sentry.captureException working inside the cache-handler.mjs file so there's no explicit alert. But the health check now checks redis for connectivity and sentry auto-instrumentation should work there.

Terraform plan for env var change, health check, and restart policy

Terraform will perform the following actions:

  # module.deployment.module.service_core.aws_lb_target_group.this[0] will be updated in-place
  ~ resource "aws_lb_target_group" "this" {
        id                                 = "arn:aws:elasticloadbalancing:us-east-1:246372085946:targetgroup/stevie-94a0/1a912e6be01e0401"
        name                               = "stevie-94a0"
        tags                               = {}
        # (17 unchanged attributes hidden)

      ~ health_check {
          ~ healthy_threshold   = 5 -> 3
          ~ interval            = 30 -> 5
          ~ path                = "/legacy_healthcheck" -> "/api/health"
            # (6 unchanged attributes hidden)
        }

        # (4 unchanged blocks hidden)
    }

  # module.deployment.module.service_bastion.module.ecs_service.data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                      = (known after apply)
      + arn_without_revision     = (known after apply)
      + container_definitions    = (known after apply)
      + cpu                      = (known after apply)
      + enable_fault_injection   = (known after apply)
      + ephemeral_storage        = (known after apply)
      + execution_role_arn       = (known after apply)
      + family                   = (known after apply)
      + id                       = (known after apply)
      + inference_accelerator    = (known after apply)
      + ipc_mode                 = (known after apply)
      + memory                   = (known after apply)
      + network_mode             = (known after apply)
      + pid_mode                 = (known after apply)
      + placement_constraints    = (known after apply)
      + proxy_configuration      = (known after apply)
      + requires_compatibilities = (known after apply)
      + revision                 = (known after apply)
      + runtime_platform         = (known after apply)
      + status                   = (known after apply)
      + task_definition          = "stevie-bastion"
      + task_role_arn            = (known after apply)
      + volume                   = (known after apply)
    }

  # module.deployment.module.service_bastion.module.ecs_service.aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-bastion:392" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-bastion" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ environment            = [
                        # (7 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "bastion"
                    # (17 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "stevie-bastion" -> (known after apply)
      ~ revision                 = 392 -> (known after apply)
        tags                     = {
            "Environment"         = "stevie-production"
            "LogicalName"         = "bastion"
            "Project"             = "Pubpub-v7"
            "Shortname"           = "590b"
            "ShortnameAnnotation" = "Shortname is calculated as first four characters of the sha1sum of the Logical Name."
        }
        # (10 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

  # module.deployment.module.service_core.module.ecs_service.data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                      = (known after apply)
      + arn_without_revision     = (known after apply)
      + container_definitions    = (known after apply)
      + cpu                      = (known after apply)
      + enable_fault_injection   = (known after apply)
      + ephemeral_storage        = (known after apply)
      + execution_role_arn       = (known after apply)
      + family                   = (known after apply)
      + id                       = (known after apply)
      + inference_accelerator    = (known after apply)
      + ipc_mode                 = (known after apply)
      + memory                   = (known after apply)
      + network_mode             = (known after apply)
      + pid_mode                 = (known after apply)
      + placement_constraints    = (known after apply)
      + proxy_configuration      = (known after apply)
      + requires_compatibilities = (known after apply)
      + revision                 = (known after apply)
      + runtime_platform         = (known after apply)
      + status                   = (known after apply)
      + task_definition          = "stevie-core"
      + task_role_arn            = (known after apply)
      + volume                   = (known after apply)
    }

  # module.deployment.module.service_core.module.ecs_service.aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-core:393" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-core" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ environment            = [
                        # (18 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "core"
                    # (17 unchanged attributes hidden)
                },
              ~ {
                  ~ environment            = [
                        # (18 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "migrations"
                    # (17 unchanged attributes hidden)
                },
                {
                    environment            = [
                        {
                            name  = "NGINX_LISTEN_PORT"
                            value = "8080"
                        },
                        {
                            name  = "NGINX_PREFIX"
                            value = "/"
                        },
                        {
                            name  = "NGINX_UPSTREAM_HOST"
                            value = "127.0.0.1"
                        },
                        {
                            name  = "NGINX_UPSTREAM_PORT"
                            value = "3000"
                        },
                        {
                            name  = "OTEL_SERVICE_NAME"
                            value = "core.nginx"
                        },
                    ]
                    essential              = true
                    image                  = "246372085946.dkr.ecr.us-east-1.amazonaws.com/nginx:latest"
                    interactive            = false
                    linuxParameters        = {
                        initProcessEnabled = true
                    }
                    logConfiguration       = {
                        logDriver = "awslogs"
                        options   = {
                            awslogs-group         = "stevie-ecs-production-container-logs"
                            awslogs-region        = "us-east-1"
                            awslogs-stream-prefix = "ecs"
                        }
                    }
                    mountPoints            = []
                    name                   = "nginx"
                    portMappings           = [
                        {
                            containerPort = 8080
                            hostPort      = 8080
                            name          = "core-nginx"
                            protocol      = "tcp"
                        },
                    ]
                    privileged             = false
                    pseudoTerminal         = false
                    readonlyRootFilesystem = false
                    startTimeout           = 30
                    stopTimeout            = 120
                    systemControls         = []
                    user                   = "0"
                    volumesFrom            = []
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "stevie-core" -> (known after apply)
      ~ revision                 = 393 -> (known after apply)
        tags                     = {
            "Environment"         = "stevie-production"
            "LogicalName"         = "core"
            "Project"             = "Pubpub-v7"
            "Shortname"           = "94a0"
            "ShortnameAnnotation" = "Shortname is calculated as first four characters of the sha1sum of the Logical Name."
        }
        # (10 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 2 to add, 1 to change, 2 to destroy.

kalilsn · 2025-05-06T13:17:11Z

Changes applied after i lowered the health check timeout because of this error:

╷
│ Error: modifying ELBv2 Target Group (arn:aws:elasticloadbalancing:us-east-1:246372085946:targetgroup/stevie-94a0/1a912e6be01e0401): operation error Elastic Load Balancing v2: ModifyTargetGroup, https response error StatusCode: 400, RequestID: fdeecfb8-6bf6-4d01-998c-446a1c2f88b0, api error ValidationError: Health check interval must be greater than the timeout.
│
│   with module.deployment.module.service_core.aws_lb_target_group.this[0],
│   on ../../modules/container-generic/main.tf line 152, in resource "aws_lb_target_group" "this":
│  152: resource "aws_lb_target_group" "this" {
│
╵

kalilsn added 2 commits April 30, 2025 14:10

Reapply "Add valkey cache handler to next app (#1210)" (#1215)

f2986b4

This reverts commit 0951e72.

Disable cache when unable to connect to redis

e169a24

kalilsn commented Apr 30, 2025

View reviewed changes

Comment thread core/cache-handler.mjs Outdated

kalilsn requested review from 3mcd and tefkah and removed request for tefkah April 30, 2025 21:40

tefkah approved these changes May 1, 2025

View reviewed changes

Comment thread core/cache-handler.mjs Outdated

tefkah previously requested changes May 1, 2025

View reviewed changes

kalilsn added 6 commits May 5, 2025 13:24

Cleanup dockerfile

5dc5946

Capture sentry exception when redis errors

536080f

Add redis query to healthcheck endpoint

cad3af5

Inline forte-digital cache handler

8304abc

Switch to ioredis

4346195

Merge branch 'main' into kalilsn/caching-again

d9aa8d5

kalilsn commented May 5, 2025

View reviewed changes

kalilsn added 5 commits May 5, 2025 22:36

Fix bugs caused by differences in hscan api between redis clients

84543b3

Fix retry strategy calculation

e234789

Add error logging to cache functions

d289b0a

Make sure fallback behavior is instant when client is reconnecting

d603548

Also fix retry logic for lib/redis client

ec1fffc

kalilsn added 2 commits May 5, 2025 23:36

Add timestamps to logs output in ci

921b6f7

Use proper healthcheck and set restart policy

8577e11

3mcd approved these changes May 6, 2025

View reviewed changes

Lower health check timeout

ebb20d4

kalilsn merged commit 1c21713 into main May 6, 2025
12 checks passed

kalilsn deleted the kalilsn/caching-again branch May 6, 2025 13:17

kalilsn mentioned this pull request May 6, 2025

Detect and handle (or prevent) deadlocks #1071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply "Add valkey cache handler to next app (#1210)" (#1215)#1220

Reapply "Add valkey cache handler to next app (#1210)" (#1215)#1220
kalilsn merged 16 commits into
mainfrom
kalilsn/caching-again

kalilsn commented Apr 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

tefkah left a comment

Uh oh!

Uh oh!

tefkah left a comment

Uh oh!

kalilsn May 5, 2025

Uh oh!

kalilsn May 5, 2025

Uh oh!

kalilsn May 6, 2025

Uh oh!

kalilsn May 5, 2025

Uh oh!

kalilsn commented May 6, 2025 •

edited

Loading

Uh oh!

kalilsn commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kalilsn commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue(s) Resolved

High-level Explanation of PR

Test Plan

Uh oh!

Uh oh!

tefkah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tefkah left a comment

Choose a reason for hiding this comment

Uh oh!

kalilsn May 5, 2025

Choose a reason for hiding this comment

Uh oh!

kalilsn May 5, 2025

Choose a reason for hiding this comment

Uh oh!

kalilsn May 6, 2025

Choose a reason for hiding this comment

Uh oh!

kalilsn May 5, 2025

Choose a reason for hiding this comment

Uh oh!

kalilsn commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalilsn commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kalilsn commented Apr 30, 2025 •

edited

Loading

kalilsn commented May 6, 2025 •

edited

Loading