Skip to content

Reapply "Add valkey cache handler to next app (#1210)" (#1215)#1220

Merged
kalilsn merged 16 commits into
mainfrom
kalilsn/caching-again
May 6, 2025
Merged

Reapply "Add valkey cache handler to next app (#1210)" (#1215)#1220
kalilsn merged 16 commits into
mainfrom
kalilsn/caching-again

Conversation

@kalilsn
Copy link
Copy Markdown
Contributor

@kalilsn kalilsn commented Apr 30, 2025

Issue(s) Resolved

Resolves #1131

High-level Explanation of PR

This PR changes core to use valkey as its cache, as opposed to the filesystem. Theoretically it should be automatically instrumented by sentry since we're using a supported redis client.

I haven't yet set a max size, because I'm really not sure what it should be. IMO we shouldn't really worry about that yet, since the large values are a problem for the node server which can't gracefully run out of memory (which we should fix), whereas valkey will simply evict keys. Once we have some data about how the cache is performing, we can tune that configuration more.

Test Plan

  • Start the app without valkey running. Ensure that it still serves requests.
  • Start the app with valkey running. In a separate terminal, run redis-cli monitor. Send a few requests and verify that you see valkey being used. Stop valkey (pnpm -w dev:cache:stop). Confirm the app still serves requests.
  • not yet working without a manual restart of the app: After either of the above scenarios, when valkey is stopped, restart valkey without restarting the app. New requests should use the valkey cache.

Comment thread core/cache-handler.mjs Outdated
@kalilsn kalilsn requested review from 3mcd and tefkah and removed request for tefkah April 30, 2025 21:40
Copy link
Copy Markdown
Member

@tefkah tefkah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it would be possible to use a fallback i think that would be ideal to decrease possible downtime/manual fiddling! ideal behavior to me in case it can't reach valkey

  • fallback to file system caching for now (or no caching, that's also fine)
  • try to reconnect in the meantime
  • if connection established, start using valkey cache

Comment thread core/cache-handler.mjs Outdated
tefkah
tefkah previously requested changes May 1, 2025
Copy link
Copy Markdown
Member

@tefkah tefkah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, i meant to request changes! if the behavior described above turns out not to be possible, i think at least creating a sentry issue would be nice (maybe it already does that if you throw)

@kalilsn kalilsn dismissed tefkah’s stale review May 5, 2025 23:07

problems fixed!

Comment thread core/cache-handler.mjs
Comment on lines +174 to +184
redisClient = new Redis({
host: process.env.VALKEY_HOST,
lazyConnect: true,
commandTimeout: 1000,
retryStrategy: (times) => {
console.log("Retrying redis connection attempt:", times);
return (2 ^ times) + Math.random() * 1000;
},
});

await redisClient.connect();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to reuse the singleton client from lib/redis but I can't seem to import it in this file (lib/redis.ts gets inlined/chunked somewhere, and the build process probably doesn't look at imports in this file). So we have to repeat some of the connection logic + params, and we have two clients per instance, but at least it's not more!

Comment thread core/cache-handler.mjs
/**
* Solution taken from here: https://github.com/vercel/next.js/discussions/48324#discussioncomment-10542097
* Creates a redis handler based on fortedigital's redis-strings handler, but using the ioredis
* client
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node-redis client does not handle reconnecting well at all and it seems like many people have reported this issue for many years, with no fix. so I switched to ioredis, which both handles reconnection properly and has much clearer documentation. unfortunately @nescha/cache-handler and the @fortedigital handlers are tightly coupled to node-redis, so I've inlined a compatible version of the redis-strings handler here. the main changes were dropping the client.isReady check (which doesn't do anything), removing the per command timeouts (because that's set on the client), and replacing the method calls with lowercase versions to match the ioredis api. see 4346195 for the exact diff from the fortedigital handler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i spoke too soon. there were more differences (84543b3) and i also realized there was a use to the ready check (d603548)

Comment thread Dockerfile
COPY --from=withpackage --chown=node:node /usr/src/app/core/.env.docker ./core/.env

CMD node core/server.js
CMD ["node", "core/server.js"]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently using non-json arguments here can prevent the server from responding to signals sent to the container

@kalilsn
Copy link
Copy Markdown
Contributor Author

kalilsn commented May 6, 2025

This works much better after switching to ioredis. The client will attempt to reconnect automatically, and it will simply fall back to uncached behavior when the cache is unreachable. Unfortunately I couldn't get Sentry.captureException working inside the cache-handler.mjs file so there's no explicit alert. But the health check now checks redis for connectivity and sentry auto-instrumentation should work there.

Terraform plan for env var change, health check, and restart policy
Terraform will perform the following actions:

  # module.deployment.module.service_core.aws_lb_target_group.this[0] will be updated in-place
  ~ resource "aws_lb_target_group" "this" {
        id                                 = "arn:aws:elasticloadbalancing:us-east-1:246372085946:targetgroup/stevie-94a0/1a912e6be01e0401"
        name                               = "stevie-94a0"
        tags                               = {}
        # (17 unchanged attributes hidden)

      ~ health_check {
          ~ healthy_threshold   = 5 -> 3
          ~ interval            = 30 -> 5
          ~ path                = "/legacy_healthcheck" -> "/api/health"
            # (6 unchanged attributes hidden)
        }

        # (4 unchanged blocks hidden)
    }

  # module.deployment.module.service_bastion.module.ecs_service.data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                      = (known after apply)
      + arn_without_revision     = (known after apply)
      + container_definitions    = (known after apply)
      + cpu                      = (known after apply)
      + enable_fault_injection   = (known after apply)
      + ephemeral_storage        = (known after apply)
      + execution_role_arn       = (known after apply)
      + family                   = (known after apply)
      + id                       = (known after apply)
      + inference_accelerator    = (known after apply)
      + ipc_mode                 = (known after apply)
      + memory                   = (known after apply)
      + network_mode             = (known after apply)
      + pid_mode                 = (known after apply)
      + placement_constraints    = (known after apply)
      + proxy_configuration      = (known after apply)
      + requires_compatibilities = (known after apply)
      + revision                 = (known after apply)
      + runtime_platform         = (known after apply)
      + status                   = (known after apply)
      + task_definition          = "stevie-bastion"
      + task_role_arn            = (known after apply)
      + volume                   = (known after apply)
    }

  # module.deployment.module.service_bastion.module.ecs_service.aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-bastion:392" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-bastion" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ environment            = [
                        # (7 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "bastion"
                    # (17 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "stevie-bastion" -> (known after apply)
      ~ revision                 = 392 -> (known after apply)
        tags                     = {
            "Environment"         = "stevie-production"
            "LogicalName"         = "bastion"
            "Project"             = "Pubpub-v7"
            "Shortname"           = "590b"
            "ShortnameAnnotation" = "Shortname is calculated as first four characters of the sha1sum of the Logical Name."
        }
        # (10 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

  # module.deployment.module.service_core.module.ecs_service.data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                      = (known after apply)
      + arn_without_revision     = (known after apply)
      + container_definitions    = (known after apply)
      + cpu                      = (known after apply)
      + enable_fault_injection   = (known after apply)
      + ephemeral_storage        = (known after apply)
      + execution_role_arn       = (known after apply)
      + family                   = (known after apply)
      + id                       = (known after apply)
      + inference_accelerator    = (known after apply)
      + ipc_mode                 = (known after apply)
      + memory                   = (known after apply)
      + network_mode             = (known after apply)
      + pid_mode                 = (known after apply)
      + placement_constraints    = (known after apply)
      + proxy_configuration      = (known after apply)
      + requires_compatibilities = (known after apply)
      + revision                 = (known after apply)
      + runtime_platform         = (known after apply)
      + status                   = (known after apply)
      + task_definition          = "stevie-core"
      + task_role_arn            = (known after apply)
      + volume                   = (known after apply)
    }

  # module.deployment.module.service_core.module.ecs_service.aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-core:393" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:246372085946:task-definition/stevie-core" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ environment            = [
                        # (18 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "core"
                    # (17 unchanged attributes hidden)
                },
              ~ {
                  ~ environment            = [
                        # (18 unchanged elements hidden)
                        {
                            name  = "SUPABASE_URL"
                            value = "https://dsleqjuvzuoycpeotdws.supabase.co"
                        },
                      ~ {
                          ~ name  = "VALKEY_URL" -> "VALKEY_HOST"
                          ~ value = "redis://stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com" -> "stevie-core-valkey-production.we8a07.ng.0001.use1.cache.amazonaws.com"
                        },
                    ]
                    name                   = "migrations"
                    # (17 unchanged attributes hidden)
                },
                {
                    environment            = [
                        {
                            name  = "NGINX_LISTEN_PORT"
                            value = "8080"
                        },
                        {
                            name  = "NGINX_PREFIX"
                            value = "/"
                        },
                        {
                            name  = "NGINX_UPSTREAM_HOST"
                            value = "127.0.0.1"
                        },
                        {
                            name  = "NGINX_UPSTREAM_PORT"
                            value = "3000"
                        },
                        {
                            name  = "OTEL_SERVICE_NAME"
                            value = "core.nginx"
                        },
                    ]
                    essential              = true
                    image                  = "246372085946.dkr.ecr.us-east-1.amazonaws.com/nginx:latest"
                    interactive            = false
                    linuxParameters        = {
                        initProcessEnabled = true
                    }
                    logConfiguration       = {
                        logDriver = "awslogs"
                        options   = {
                            awslogs-group         = "stevie-ecs-production-container-logs"
                            awslogs-region        = "us-east-1"
                            awslogs-stream-prefix = "ecs"
                        }
                    }
                    mountPoints            = []
                    name                   = "nginx"
                    portMappings           = [
                        {
                            containerPort = 8080
                            hostPort      = 8080
                            name          = "core-nginx"
                            protocol      = "tcp"
                        },
                    ]
                    privileged             = false
                    pseudoTerminal         = false
                    readonlyRootFilesystem = false
                    startTimeout           = 30
                    stopTimeout            = 120
                    systemControls         = []
                    user                   = "0"
                    volumesFrom            = []
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "stevie-core" -> (known after apply)
      ~ revision                 = 393 -> (known after apply)
        tags                     = {
            "Environment"         = "stevie-production"
            "LogicalName"         = "core"
            "Project"             = "Pubpub-v7"
            "Shortname"           = "94a0"
            "ShortnameAnnotation" = "Shortname is calculated as first four characters of the sha1sum of the Logical Name."
        }
        # (10 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 2 to add, 1 to change, 2 to destroy.

@kalilsn
Copy link
Copy Markdown
Contributor Author

kalilsn commented May 6, 2025

Changes applied after i lowered the health check timeout because of this error:

╷
│ Error: modifying ELBv2 Target Group (arn:aws:elasticloadbalancing:us-east-1:246372085946:targetgroup/stevie-94a0/1a912e6be01e0401): operation error Elastic Load Balancing v2: ModifyTargetGroup, https response error StatusCode: 400, RequestID: fdeecfb8-6bf6-4d01-998c-446a1c2f88b0, api error ValidationError: Health check interval must be greater than the timeout.
│
│   with module.deployment.module.service_core.aws_lb_target_group.this[0],
│   on ../../modules/container-generic/main.tf line 152, in resource "aws_lb_target_group" "this":
│  152: resource "aws_lb_target_group" "this" {
│
╵

@kalilsn kalilsn merged commit 1c21713 into main May 6, 2025
12 checks passed
@kalilsn kalilsn deleted the kalilsn/caching-again branch May 6, 2025 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autocache using redis

3 participants