release: promote OTP and prod deploy guards by rhanka · Pull Request #65 · matchID-project/matchID

rhanka · 2026-05-23T12:36:03Z

Summary

promote dev to master for the OTP backend release
include embedded graphify artifacts already merged in dev
include prod deploy guards from fix(cd): preserve prod server on deploy #64 so manual prod dispatch preserves the existing prod server and refuses a third preserved prod server

Release safety

merge commit must include [skip ci] to avoid automatic CD/dataprep-full on push master
prod deploy will be triggered separately with dataprep_scope=none and deploy_target=prod

Checks

dev PR checks are expected to run on this release PR before merge

Adds a `deploy/k8s/` tree to drive matchID on a local k3d cluster and on the Scaleway Kapsule `poc` cluster (rhanka/poc-k8s). Not wired into CI/CD yet; meant to be applied by hand for the burst-mode test sessions described in the matchID onboarding intake against poc-k8s. Files added: - deploy/k8s/README.md local k3d + poc cluster flows, known gaps - deploy/k8s/Makefile k3d-up/down, apply-local/poc, port-forward, logs, status - deploy/k8s/base/ kustomize base * namespace.yaml matchid Namespace (skipped in poc overlay) * deces-backend.deployment + service.yaml matchid/deces-backend:latest, 8080 * deces-ui.deployment + service.yaml matchid/deces-ui:latest, 8083 * elasticsearch.statefulset.yaml ES 7.17.28, dev profile, JVM -Xmx512m * ingress.yaml Traefik IngressRoute (deces.local) * kustomization.yaml - deploy/k8s/local/kustomization.yaml alias overlay pointing at overlays/local/ - deploy/k8s/overlays/local/ k3d / k3s-local * NodePort services (UI 30083, backend 30080) * hostPath PV for ES (StorageClass matchid-local) * drops the privileged sysctl init container (k3s ships with vm.max_map_count high enough) - deploy/k8s/overlays/poc/ Scaleway Kapsule poc * nodeSelector pool=burst + toleration on every workload * replicas: 0 at rest on all three workloads (burst-mode tenant) * scw-bssd PVC for ES, IngressRoute on matchid-poc.matchid.io with cert-manager TLS * deletes the base Namespace (poc-k8s owns it under tenants/matchid/) Resource sizing matches the poc-k8s intake (PR request/matchid-onboarding): deces-backend 100m / 500m + 256Mi / 512Mi deces-ui 50m / 200m + 64Mi / 128Mi elasticsearch 250m / 1500m + 512Mi / 1Gi Validation: kubectl apply --dry-run=client --validate=false -k overlays/local/ OK kubectl apply --dry-run=client --validate=false -k overlays/poc/ OK Caveats / not yet wired (documented in deploy/k8s/README.md): - ES version drift: repo Makefiles pin 8.6.1 today; we ship 7.17.28 here to stay under 1 GiB heap on the poc cluster. Reconciliation is part of the surch swap follow-up. - Surch swap: long-term plan is to drop the ES StatefulSet and point deces-backend at the surch tenant's surch-api Service (blocked on the DSL inventory in EXPERIMENT_SURCH.md). - cert-manager / letsencrypt-prod ClusterIssuer is referenced but provisioned out-of-band by poc-k8s. - OIDC auth + SMTP secrets are envFrom: secretRef with optional: true; the Secret itself is provisioned out-of-tree. - No .github/workflows/k8s-*.yml yet; CI/CD wiring is a follow-up referenced in the poc-k8s intake (request/matchid-onboarding). - deces-dataprep (INSEE ingest Job) not manifested yet; read-path lands first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…veat - Add "Environment tiers" section: CI=k3s/k3d, Dev=Kapsule poc, Prod=TBD - Document local prereqs: Docker + kubectl + k3d + ≥15% free disk on /, with the diagnostic+fix when kubelet's DiskPressure taint hits. Caught during a smoke run today: laptop at 99% on / put every Pod in Pending with FailedScheduling pointing at DiskPressure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e smoke Adds .github/workflows/k8s-smoke.yml driving two paths: - smoke-local (auto on push/PR with deploy/k8s/** changes): installs k3d inside the ubuntu-latest runner, brings up a single-node k3s cluster, applies overlays/local, waits Traefik CRDs + workload availability, curls the deces-backend healthcheck + UI through NodePort. Tear down at the end regardless of outcome. - smoke-poc (workflow_dispatch only): pulls a kubeconfig for the Scaleway Kapsule `poc` cluster via the SCW CLI, applies overlays/poc, waits availability, curls the IngressRoute (Host: matchid-poc.matchid.io). Falls back to port-forward if the Traefik LB IP isn't ready yet. Path-filter on deploy/k8s/** + workflow file to keep cost low. New secrets needed (header comment in the workflow lists them); the smoke-local path is the CI gate for this experimental branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comprehensive audit of architectural patterns breaking under Kubernetes: - 8 P0 findings (in-memory state, file persistence, ES single-node) - 6 P1 findings (sticky sessions, job timeouts, worker isolation) - 4 P2 findings (logging, encryption, user DB) Most critical: 1. OTP store in memory (mail.ts:60) - loses all OTPs on pod restart 2. IP-rate-limit maps (auth.ts:4-5) - routing to different pods bypasses bans 3. Job state arrays (processStream.ts:57-60) - lost on restart, breaks bulk Effort: 6-8 weeks to K8s-ready (P0: 2w, P1: 2w, P2: 1w). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n diags on cancel The first runs (#25974574306, #25974575275) cancelled at the 15min job cap with no diagnostics because `if: failure()` doesn't fire on cancellation. Patched workflow: - Split "Wait for deployments" into ES-first then backend/UI so a slow ES doesn't eat the budget meant for backend. - Background poller emits `kubectl get pods -o wide` + events every 30s during the wait, so the run logs always show why a pod is unhappy even when the parent step times out. - Diagnostics now triggers on `failure() || cancelled()` so we capture state when the runner reaps the job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ES container runs as uid 1000; some storage drivers (hostPath in local overlay, plain manual PV) don't honour `fsGroup: 1000`, leaving the mount root-owned. ES then crashes on boot with: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes Caused by: org.elasticsearch.ElasticsearchException: failed to bind service Add a `fix-data-permissions` busybox init container that chowns the data dir to 1000:1000 before ES starts. Carried in base (so Kapsule inherits it harmlessly) and re-stated in the local overlay (since the overlay was previously setting `initContainers: []` to drop the sysctl init container). Caught by CI run #25974962800 — diagnostic poller surfaced the actual ES stack trace which the original `if: failure()`-only diag step would have missed (the run cancelled instead of failed cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…APP_FRONTEND) Three pod-level crash-loops on the prior smoke run: 1. deces-backend exited with "BullMQ Worker: concurrency must be a finite number greater than 0" because BACKEND_JOB_CONCURRENCY and BACKEND_CHUNK_CONCURRENCY were unset. Added defaults (2/2) plus the rest of the env vars the image's index.js reads at module load (APP_DNS, APP_URL, BACKEND_LOG_TIMER, BACKEND_TMP_*, DISPOSABLE_MAIL, COMMUNES_JSON, DB_JSON, WIKIDATA_LINKS → /dev/null for non-fatal warnings on missing data files). 2. deces-backend ALSO has no Redis to talk to. Added a minimal Redis Deployment + Service in base (redis:7.2-alpine, 128MB maxmem, ephemeral). Wired REDIS_HOST=redis / REDIS_PORT=6379 into the backend env block. Service name 'redis' resolves intra-namespace. 3. deces-ui:latest (built 2026-04-26) ships an older nginx/run.sh that checks `APP` (current main checks `APP_FRONTEND`). Setting both env vars on the Deployment so the manifest works against either tag. Workflow patched to wait for redis (2m) before ES (6m) before backend/ui (4m). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deces-ui nginx run.sh substitutes every `<VAR>` placeholder in the template (nginx.conf.template + default.conf.template) with the value of the matching env var. Any unreplaced placeholder leaves invalid syntax like `$<API_USER_SCOPE>` in /etc/nginx/nginx.conf line 19, and nginx aborts with: [emerg] invalid variable name in /etc/nginx/nginx.conf:19 Adds deces-ui-nginx ConfigMap with defaults copied verbatim from packages/deces-ui/Makefile: API_USER_SCOPE, API_*_LIMIT_RATE, API_*_BURST, API_READ_TIMEOUT, API_SEND_TIMEOUT, API_MAX_BODY, NGINX_CSP, GOOGLE_ANALYTICS_ID, GOOGLE_ADSENSE_ID, DATAGOUV_*. deces-ui Deployment now `envFrom`-s it. Caught by CI run #25975385301 where backend / redis / ES all reached Ready 1/1 but deces-ui crashed in nginx config validation. ES went from CrashLoopBackOff to Running thanks to the previous chown init container fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge PR #55 as POC scaffolding only. Local k8s smoke is green; live Kapsule smoke remains track B and must follow the poc-k8s tenant contract.

Aligns the POC smoke workflow with the tenant kubeconfig contract from poc-k8s.

Allows the POC smoke to run on clusters without Traefik IngressRoute installed, using the existing port-forward fallback.

Uses the Scaleway-managed pool-name label for POC burst scheduling.

Lots 0-8 stay closed and checked. Replace post-lot-8 backlog by lettered workpackages: WP-A (k8s readiness), WP-B (Surch parity + benches k8s), WP-C (adjacent tracks). Convention: status reports read fait / a faire / attendus grouped by WP.

Replaces process-memory OTP store by a shared Redis store so authentication survives pod restarts and multi-pod runtime (k8s readiness, WP-A). - New redisClient.ts: shared ioredis client reused outside BullMQ. - mail.ts: OTP key is sha256(email) under otp:<digest> with 6h TTL. On Redis outage, validateOTP refuses with "Service temporairement indisponible" rather than bypassing or crashing. - auth.controller.ts: validateOTP is now awaited; the controller deletes the Redis key after a successful match. - mail.spec.ts: OTP coverage uses ioredis-mock; SMTP is mocked; disposable mail fixture is self-contained. - package.json: add ioredis, ioredis-mock.

Route remote CD SSH jobs through the existing proxy/bastion instead of relying on GitHub-hosted runner CIDR allowlists.

Add the bastion host key to known_hosts before CD jobs use ProxyJump for dataprep and deploy SSH commands.

Configure the CD runner SSH client so the ProxyJump bastion uses the deployed id_rsa_matchID identity explicitly.

Keep manual prod deploy from deleting the existing prod server, and refuse creating a third preserved prod server. CI passed on PR #64 after rerunning the checkout-only failure.

coderabbitai · 2026-05-23T12:36:11Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9902f20-9e3f-4372-89fd-46571319f25b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rhanka · 2026-05-23T12:36:50Z

Closing this dev->master PR because the dev HEAD contains [skip ci], so CI did not run for the release PR. Reopening via a dedicated release branch with a non-skip empty release marker commit to get normal PR checks before merge.

rhanka added 30 commits September 8, 2025 20:21

remove usage of DC_PREFIX in root Makefile

957f721

move APP to APP_FRONTEND env var to clearly separate scopes of packages

4ccc84d

move ES_ vars to deces-infra (Makefile refacto)

281d22d

remove most duplicate BACKEND_ vars from root and deces_backend Makefile

b63d3cb

removed backend make targets from root and use include

997945a

docs: formalize monorepo migration plan

e51592a

docs: reframe migration plan around checkpoints

a73ffe6

docs: validate migration framing for lot 0

f23502f

sync tools datagouv checksum handling

8540905

sync deces-dataprep file selection regex

6c02c37

normalize deces-dataprep file regex for shell

ddb76b3

docs: mark completed lot 0 and lot 1 sync steps

41dd37b

sync deces-backend score and contact email

827488e

sync deces-backend otp mail handling

b41e6c6

sync deces-backend lockfile security updates

789177f

docs: mark deces-backend sync progress

9cdf6d9

docs: record deces-backend residuals and progress

03ff25d

docs: correct plan validation commands

e9b676d

docs: correct lot 1 residual tracking

71f0d1d

fix local config sudo policy

dd1de34

docs: refine lot 1 and lot 4 runtime plan

065bf56

reuse monorepo infra in deces-dataprep

f482d01

add canonical data version target

d228387

docs: mark lot 1 dataprep test progress

930e4b4

stabilize deces-dataprep dev startup

b9df53f

docs: record lot 1 validation evidence

8ed9ac4

docs: close lot 1 and add cleanup lot

be50926

fix backend make-only local validation

0cbd724

stabilize backend bulk test harness

a032cc1

stabilize backend vitest execution

b1e26f5

rhanka and others added 26 commits May 15, 2026 15:14

experiment(k8s): scaffold k3s-local + Kapsule poc overlays for matchID

b689c6f

Merge PR #55 as POC scaffolding only. Local k8s smoke is green; live Kapsule smoke remains track B and must follow the poc-k8s tenant contract.

fix(k8s): align poc smoke with tenant kubeconfig contract

b6a80ff

fix(k8s): align poc smoke with tenant kubeconfig contract

6425499

Aligns the POC smoke workflow with the tenant kubeconfig contract from poc-k8s.

fix(k8s): make poc ingress optional

ebeb0c3

fix(k8s): make poc ingress optional

b6383c2

Allows the POC smoke to run on clusters without Traefik IngressRoute installed, using the existing port-forward fallback.

fix(k8s): target scaleway burst pool label

b03e598

fix(k8s): target scaleway burst pool label

f4c7e04

Uses the Scaleway-managed pool-name label for POC burst scheduling.

test(backend): satisfy OTP mail mock lint

9572f60

test(ui): align Playwright image version

b7fbcfb

fix(backend): allow initial OTP Redis command queue

8e96ce5

chore: add embedded graphify artifacts

7da6ee7

fix(cd): route remote SSH through bastion

12472bd

Route remote CD SSH jobs through the existing proxy/bastion instead of relying on GitHub-hosted runner CIDR allowlists.

fix(cd): trust dataprep SSH bastion host key

4309909

Add the bastion host key to known_hosts before CD jobs use ProxyJump for dataprep and deploy SSH commands.

fix(cd): configure SSH identity for bastion jump

97d09a6

Configure the CD runner SSH client so the ProxyJump bastion uses the deployed id_rsa_matchID identity explicitly.

fix(cd): preserve prod server on deploy

a9de143

fix(cd): preserve prod server on deploy [skip ci]

778491a

Keep manual prod deploy from deleting the existing prod server, and refuse creating a third preserved prod server. CI passed on PR #64 after rerunning the checkout-only failure.

rhanka closed this May 23, 2026

rhanka deleted the dev branch May 23, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: promote OTP and prod deploy guards#65

release: promote OTP and prod deploy guards#65
rhanka wants to merge 256 commits into
masterfrom
dev

rhanka commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Review skipped

Uh oh!

rhanka commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhanka commented May 23, 2026

Summary

Release safety

Checks

Uh oh!

coderabbitai Bot commented May 23, 2026

Review skipped

Uh oh!

rhanka commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant