Skip to content

release: promote OTP and prod deploy guards#65

Closed
rhanka wants to merge 256 commits into
masterfrom
dev
Closed

release: promote OTP and prod deploy guards#65
rhanka wants to merge 256 commits into
masterfrom
dev

Conversation

@rhanka
Copy link
Copy Markdown
Member

@rhanka rhanka commented May 23, 2026

Summary

  • promote dev to master for the OTP backend release
  • include embedded graphify artifacts already merged in dev
  • include prod deploy guards from fix(cd): preserve prod server on deploy #64 so manual prod dispatch preserves the existing prod server and refuses a third preserved prod server

Release safety

  • merge commit must include [skip ci] to avoid automatic CD/dataprep-full on push master
  • prod deploy will be triggered separately with dataprep_scope=none and deploy_target=prod

Checks

  • dev PR checks are expected to run on this release PR before merge

rhanka and others added 26 commits May 15, 2026 15:14
Adds a `deploy/k8s/` tree to drive matchID on a local k3d cluster and
on the Scaleway Kapsule `poc` cluster (rhanka/poc-k8s). Not wired into
CI/CD yet; meant to be applied by hand for the burst-mode test sessions
described in the matchID onboarding intake against poc-k8s.

Files added:
- deploy/k8s/README.md           local k3d + poc cluster flows, known gaps
- deploy/k8s/Makefile            k3d-up/down, apply-local/poc, port-forward, logs, status
- deploy/k8s/base/               kustomize base
  * namespace.yaml               matchid Namespace (skipped in poc overlay)
  * deces-backend.deployment + service.yaml  matchid/deces-backend:latest, 8080
  * deces-ui.deployment + service.yaml       matchid/deces-ui:latest, 8083
  * elasticsearch.statefulset.yaml           ES 7.17.28, dev profile, JVM -Xmx512m
  * ingress.yaml                              Traefik IngressRoute (deces.local)
  * kustomization.yaml
- deploy/k8s/local/kustomization.yaml        alias overlay pointing at overlays/local/
- deploy/k8s/overlays/local/     k3d / k3s-local
  * NodePort services (UI 30083, backend 30080)
  * hostPath PV for ES (StorageClass matchid-local)
  * drops the privileged sysctl init container (k3s ships with vm.max_map_count high enough)
- deploy/k8s/overlays/poc/       Scaleway Kapsule poc
  * nodeSelector pool=burst + toleration on every workload
  * replicas: 0 at rest on all three workloads (burst-mode tenant)
  * scw-bssd PVC for ES, IngressRoute on matchid-poc.matchid.io with cert-manager TLS
  * deletes the base Namespace (poc-k8s owns it under tenants/matchid/)

Resource sizing matches the poc-k8s intake (PR request/matchid-onboarding):
  deces-backend  100m / 500m   + 256Mi / 512Mi
  deces-ui        50m / 200m   +  64Mi / 128Mi
  elasticsearch  250m / 1500m  + 512Mi / 1Gi

Validation:
  kubectl apply --dry-run=client --validate=false -k overlays/local/  OK
  kubectl apply --dry-run=client --validate=false -k overlays/poc/    OK

Caveats / not yet wired (documented in deploy/k8s/README.md):
- ES version drift: repo Makefiles pin 8.6.1 today; we ship 7.17.28
  here to stay under 1 GiB heap on the poc cluster. Reconciliation is
  part of the surch swap follow-up.
- Surch swap: long-term plan is to drop the ES StatefulSet and point
  deces-backend at the surch tenant's surch-api Service (blocked on
  the DSL inventory in EXPERIMENT_SURCH.md).
- cert-manager / letsencrypt-prod ClusterIssuer is referenced but
  provisioned out-of-band by poc-k8s.
- OIDC auth + SMTP secrets are envFrom: secretRef with optional: true;
  the Secret itself is provisioned out-of-tree.
- No .github/workflows/k8s-*.yml yet; CI/CD wiring is a follow-up
  referenced in the poc-k8s intake (request/matchid-onboarding).
- deces-dataprep (INSEE ingest Job) not manifested yet; read-path
  lands first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…veat

- Add "Environment tiers" section: CI=k3s/k3d, Dev=Kapsule poc, Prod=TBD
- Document local prereqs: Docker + kubectl + k3d + ≥15% free disk on /,
  with the diagnostic+fix when kubelet's DiskPressure taint hits.

Caught during a smoke run today: laptop at 99% on / put every Pod in
Pending with FailedScheduling pointing at DiskPressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e smoke

Adds .github/workflows/k8s-smoke.yml driving two paths:

- smoke-local (auto on push/PR with deploy/k8s/** changes): installs k3d
  inside the ubuntu-latest runner, brings up a single-node k3s cluster,
  applies overlays/local, waits Traefik CRDs + workload availability,
  curls the deces-backend healthcheck + UI through NodePort. Tear down
  at the end regardless of outcome.

- smoke-poc (workflow_dispatch only): pulls a kubeconfig for the
  Scaleway Kapsule `poc` cluster via the SCW CLI, applies overlays/poc,
  waits availability, curls the IngressRoute (Host:
  matchid-poc.matchid.io). Falls back to port-forward if the Traefik LB
  IP isn't ready yet.

Path-filter on deploy/k8s/** + workflow file to keep cost low.

New secrets needed (header comment in the workflow lists them); the
smoke-local path is the CI gate for this experimental branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive audit of architectural patterns breaking under Kubernetes:
- 8 P0 findings (in-memory state, file persistence, ES single-node)
- 6 P1 findings (sticky sessions, job timeouts, worker isolation)
- 4 P2 findings (logging, encryption, user DB)

Most critical:
1. OTP store in memory (mail.ts:60) - loses all OTPs on pod restart
2. IP-rate-limit maps (auth.ts:4-5) - routing to different pods bypasses bans
3. Job state arrays (processStream.ts:57-60) - lost on restart, breaks bulk

Effort: 6-8 weeks to K8s-ready (P0: 2w, P1: 2w, P2: 1w).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n diags on cancel

The first runs (#25974574306, #25974575275) cancelled at the 15min job
cap with no diagnostics because `if: failure()` doesn't fire on
cancellation. Patched workflow:

- Split "Wait for deployments" into ES-first then backend/UI so a slow
  ES doesn't eat the budget meant for backend.
- Background poller emits `kubectl get pods -o wide` + events every 30s
  during the wait, so the run logs always show why a pod is unhappy
  even when the parent step times out.
- Diagnostics now triggers on `failure() || cancelled()` so we capture
  state when the runner reaps the job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ES container runs as uid 1000; some storage drivers (hostPath in local
overlay, plain manual PV) don't honour `fsGroup: 1000`, leaving the
mount root-owned. ES then crashes on boot with:

  java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
  Caused by: org.elasticsearch.ElasticsearchException: failed to bind service

Add a `fix-data-permissions` busybox init container that chowns the
data dir to 1000:1000 before ES starts. Carried in base (so Kapsule
inherits it harmlessly) and re-stated in the local overlay (since the
overlay was previously setting `initContainers: []` to drop the sysctl
init container).

Caught by CI run #25974962800 — diagnostic poller surfaced the actual
ES stack trace which the original `if: failure()`-only diag step would
have missed (the run cancelled instead of failed cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…APP_FRONTEND)

Three pod-level crash-loops on the prior smoke run:

1. deces-backend exited with "BullMQ Worker: concurrency must be a
   finite number greater than 0" because BACKEND_JOB_CONCURRENCY and
   BACKEND_CHUNK_CONCURRENCY were unset. Added defaults (2/2) plus
   the rest of the env vars the image's index.js reads at module load
   (APP_DNS, APP_URL, BACKEND_LOG_TIMER, BACKEND_TMP_*, DISPOSABLE_MAIL,
   COMMUNES_JSON, DB_JSON, WIKIDATA_LINKS → /dev/null for non-fatal
   warnings on missing data files).

2. deces-backend ALSO has no Redis to talk to. Added a minimal Redis
   Deployment + Service in base (redis:7.2-alpine, 128MB maxmem,
   ephemeral). Wired REDIS_HOST=redis / REDIS_PORT=6379 into the
   backend env block. Service name 'redis' resolves intra-namespace.

3. deces-ui:latest (built 2026-04-26) ships an older nginx/run.sh that
   checks `APP` (current main checks `APP_FRONTEND`). Setting both env
   vars on the Deployment so the manifest works against either tag.

Workflow patched to wait for redis (2m) before ES (6m) before
backend/ui (4m).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deces-ui nginx run.sh substitutes every `<VAR>` placeholder in the
template (nginx.conf.template + default.conf.template) with the value
of the matching env var. Any unreplaced placeholder leaves invalid
syntax like `$<API_USER_SCOPE>` in /etc/nginx/nginx.conf line 19, and
nginx aborts with:

  [emerg] invalid variable name in /etc/nginx/nginx.conf:19

Adds deces-ui-nginx ConfigMap with defaults copied verbatim from
packages/deces-ui/Makefile: API_USER_SCOPE, API_*_LIMIT_RATE,
API_*_BURST, API_READ_TIMEOUT, API_SEND_TIMEOUT, API_MAX_BODY,
NGINX_CSP, GOOGLE_ANALYTICS_ID, GOOGLE_ADSENSE_ID, DATAGOUV_*.
deces-ui Deployment now `envFrom`-s it.

Caught by CI run #25975385301 where backend / redis / ES all reached
Ready 1/1 but deces-ui crashed in nginx config validation. ES went
from CrashLoopBackOff to Running thanks to the previous chown init
container fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge PR #55 as POC scaffolding only. Local k8s smoke is green; live Kapsule smoke remains track B and must follow the poc-k8s tenant contract.
Aligns the POC smoke workflow with the tenant kubeconfig contract from poc-k8s.
Allows the POC smoke to run on clusters without Traefik IngressRoute installed, using the existing port-forward fallback.
Uses the Scaleway-managed pool-name label for POC burst scheduling.
Lots 0-8 stay closed and checked. Replace post-lot-8 backlog by lettered workpackages: WP-A (k8s readiness), WP-B (Surch parity + benches k8s), WP-C (adjacent tracks). Convention: status reports read fait / a faire / attendus grouped by WP.
Replaces process-memory OTP store by a shared Redis store so authentication survives pod restarts and multi-pod runtime (k8s readiness, WP-A).

- New redisClient.ts: shared ioredis client reused outside BullMQ.
- mail.ts: OTP key is sha256(email) under otp:<digest> with 6h TTL.
  On Redis outage, validateOTP refuses with "Service temporairement indisponible" rather than bypassing or crashing.
- auth.controller.ts: validateOTP is now awaited; the controller deletes the Redis key after a successful match.
- mail.spec.ts: OTP coverage uses ioredis-mock; SMTP is mocked; disposable mail fixture is self-contained.
- package.json: add ioredis, ioredis-mock.
Route remote CD SSH jobs through the existing proxy/bastion instead of relying on GitHub-hosted runner CIDR allowlists.
Add the bastion host key to known_hosts before CD jobs use ProxyJump for dataprep and deploy SSH commands.
Configure the CD runner SSH client so the ProxyJump bastion uses the deployed id_rsa_matchID identity explicitly.
Keep manual prod deploy from deleting the existing prod server, and refuse creating a third preserved prod server. CI passed on PR #64 after rerunning the checkout-only failure.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9902f20-9e3f-4372-89fd-46571319f25b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rhanka
Copy link
Copy Markdown
Member Author

rhanka commented May 23, 2026

Closing this dev->master PR because the dev HEAD contains [skip ci], so CI did not run for the release PR. Reopening via a dedicated release branch with a non-skip empty release marker commit to get normal PR checks before merge.

@rhanka rhanka closed this May 23, 2026
@rhanka rhanka deleted the dev branch May 23, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant