Conversation
Adds a `deploy/k8s/` tree to drive matchID on a local k3d cluster and on the Scaleway Kapsule `poc` cluster (rhanka/poc-k8s). Not wired into CI/CD yet; meant to be applied by hand for the burst-mode test sessions described in the matchID onboarding intake against poc-k8s. Files added: - deploy/k8s/README.md local k3d + poc cluster flows, known gaps - deploy/k8s/Makefile k3d-up/down, apply-local/poc, port-forward, logs, status - deploy/k8s/base/ kustomize base * namespace.yaml matchid Namespace (skipped in poc overlay) * deces-backend.deployment + service.yaml matchid/deces-backend:latest, 8080 * deces-ui.deployment + service.yaml matchid/deces-ui:latest, 8083 * elasticsearch.statefulset.yaml ES 7.17.28, dev profile, JVM -Xmx512m * ingress.yaml Traefik IngressRoute (deces.local) * kustomization.yaml - deploy/k8s/local/kustomization.yaml alias overlay pointing at overlays/local/ - deploy/k8s/overlays/local/ k3d / k3s-local * NodePort services (UI 30083, backend 30080) * hostPath PV for ES (StorageClass matchid-local) * drops the privileged sysctl init container (k3s ships with vm.max_map_count high enough) - deploy/k8s/overlays/poc/ Scaleway Kapsule poc * nodeSelector pool=burst + toleration on every workload * replicas: 0 at rest on all three workloads (burst-mode tenant) * scw-bssd PVC for ES, IngressRoute on matchid-poc.matchid.io with cert-manager TLS * deletes the base Namespace (poc-k8s owns it under tenants/matchid/) Resource sizing matches the poc-k8s intake (PR request/matchid-onboarding): deces-backend 100m / 500m + 256Mi / 512Mi deces-ui 50m / 200m + 64Mi / 128Mi elasticsearch 250m / 1500m + 512Mi / 1Gi Validation: kubectl apply --dry-run=client --validate=false -k overlays/local/ OK kubectl apply --dry-run=client --validate=false -k overlays/poc/ OK Caveats / not yet wired (documented in deploy/k8s/README.md): - ES version drift: repo Makefiles pin 8.6.1 today; we ship 7.17.28 here to stay under 1 GiB heap on the poc cluster. Reconciliation is part of the surch swap follow-up. - Surch swap: long-term plan is to drop the ES StatefulSet and point deces-backend at the surch tenant's surch-api Service (blocked on the DSL inventory in EXPERIMENT_SURCH.md). - cert-manager / letsencrypt-prod ClusterIssuer is referenced but provisioned out-of-band by poc-k8s. - OIDC auth + SMTP secrets are envFrom: secretRef with optional: true; the Secret itself is provisioned out-of-tree. - No .github/workflows/k8s-*.yml yet; CI/CD wiring is a follow-up referenced in the poc-k8s intake (request/matchid-onboarding). - deces-dataprep (INSEE ingest Job) not manifested yet; read-path lands first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…veat - Add "Environment tiers" section: CI=k3s/k3d, Dev=Kapsule poc, Prod=TBD - Document local prereqs: Docker + kubectl + k3d + ≥15% free disk on /, with the diagnostic+fix when kubelet's DiskPressure taint hits. Caught during a smoke run today: laptop at 99% on / put every Pod in Pending with FailedScheduling pointing at DiskPressure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e smoke Adds .github/workflows/k8s-smoke.yml driving two paths: - smoke-local (auto on push/PR with deploy/k8s/** changes): installs k3d inside the ubuntu-latest runner, brings up a single-node k3s cluster, applies overlays/local, waits Traefik CRDs + workload availability, curls the deces-backend healthcheck + UI through NodePort. Tear down at the end regardless of outcome. - smoke-poc (workflow_dispatch only): pulls a kubeconfig for the Scaleway Kapsule `poc` cluster via the SCW CLI, applies overlays/poc, waits availability, curls the IngressRoute (Host: matchid-poc.matchid.io). Falls back to port-forward if the Traefik LB IP isn't ready yet. Path-filter on deploy/k8s/** + workflow file to keep cost low. New secrets needed (header comment in the workflow lists them); the smoke-local path is the CI gate for this experimental branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comprehensive audit of architectural patterns breaking under Kubernetes: - 8 P0 findings (in-memory state, file persistence, ES single-node) - 6 P1 findings (sticky sessions, job timeouts, worker isolation) - 4 P2 findings (logging, encryption, user DB) Most critical: 1. OTP store in memory (mail.ts:60) - loses all OTPs on pod restart 2. IP-rate-limit maps (auth.ts:4-5) - routing to different pods bypasses bans 3. Job state arrays (processStream.ts:57-60) - lost on restart, breaks bulk Effort: 6-8 weeks to K8s-ready (P0: 2w, P1: 2w, P2: 1w). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n diags on cancel The first runs (#25974574306, #25974575275) cancelled at the 15min job cap with no diagnostics because `if: failure()` doesn't fire on cancellation. Patched workflow: - Split "Wait for deployments" into ES-first then backend/UI so a slow ES doesn't eat the budget meant for backend. - Background poller emits `kubectl get pods -o wide` + events every 30s during the wait, so the run logs always show why a pod is unhappy even when the parent step times out. - Diagnostics now triggers on `failure() || cancelled()` so we capture state when the runner reaps the job. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ES container runs as uid 1000; some storage drivers (hostPath in local overlay, plain manual PV) don't honour `fsGroup: 1000`, leaving the mount root-owned. ES then crashes on boot with: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes Caused by: org.elasticsearch.ElasticsearchException: failed to bind service Add a `fix-data-permissions` busybox init container that chowns the data dir to 1000:1000 before ES starts. Carried in base (so Kapsule inherits it harmlessly) and re-stated in the local overlay (since the overlay was previously setting `initContainers: []` to drop the sysctl init container). Caught by CI run #25974962800 — diagnostic poller surfaced the actual ES stack trace which the original `if: failure()`-only diag step would have missed (the run cancelled instead of failed cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…APP_FRONTEND) Three pod-level crash-loops on the prior smoke run: 1. deces-backend exited with "BullMQ Worker: concurrency must be a finite number greater than 0" because BACKEND_JOB_CONCURRENCY and BACKEND_CHUNK_CONCURRENCY were unset. Added defaults (2/2) plus the rest of the env vars the image's index.js reads at module load (APP_DNS, APP_URL, BACKEND_LOG_TIMER, BACKEND_TMP_*, DISPOSABLE_MAIL, COMMUNES_JSON, DB_JSON, WIKIDATA_LINKS → /dev/null for non-fatal warnings on missing data files). 2. deces-backend ALSO has no Redis to talk to. Added a minimal Redis Deployment + Service in base (redis:7.2-alpine, 128MB maxmem, ephemeral). Wired REDIS_HOST=redis / REDIS_PORT=6379 into the backend env block. Service name 'redis' resolves intra-namespace. 3. deces-ui:latest (built 2026-04-26) ships an older nginx/run.sh that checks `APP` (current main checks `APP_FRONTEND`). Setting both env vars on the Deployment so the manifest works against either tag. Workflow patched to wait for redis (2m) before ES (6m) before backend/ui (4m). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deces-ui nginx run.sh substitutes every `<VAR>` placeholder in the template (nginx.conf.template + default.conf.template) with the value of the matching env var. Any unreplaced placeholder leaves invalid syntax like `$<API_USER_SCOPE>` in /etc/nginx/nginx.conf line 19, and nginx aborts with: [emerg] invalid variable name in /etc/nginx/nginx.conf:19 Adds deces-ui-nginx ConfigMap with defaults copied verbatim from packages/deces-ui/Makefile: API_USER_SCOPE, API_*_LIMIT_RATE, API_*_BURST, API_READ_TIMEOUT, API_SEND_TIMEOUT, API_MAX_BODY, NGINX_CSP, GOOGLE_ANALYTICS_ID, GOOGLE_ADSENSE_ID, DATAGOUV_*. deces-ui Deployment now `envFrom`-s it. Caught by CI run #25975385301 where backend / redis / ES all reached Ready 1/1 but deces-ui crashed in nginx config validation. ES went from CrashLoopBackOff to Running thanks to the previous chown init container fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge PR #55 as POC scaffolding only. Local k8s smoke is green; live Kapsule smoke remains track B and must follow the poc-k8s tenant contract.
Aligns the POC smoke workflow with the tenant kubeconfig contract from poc-k8s.
Allows the POC smoke to run on clusters without Traefik IngressRoute installed, using the existing port-forward fallback.
Uses the Scaleway-managed pool-name label for POC burst scheduling.
Lots 0-8 stay closed and checked. Replace post-lot-8 backlog by lettered workpackages: WP-A (k8s readiness), WP-B (Surch parity + benches k8s), WP-C (adjacent tracks). Convention: status reports read fait / a faire / attendus grouped by WP.
Replaces process-memory OTP store by a shared Redis store so authentication survives pod restarts and multi-pod runtime (k8s readiness, WP-A). - New redisClient.ts: shared ioredis client reused outside BullMQ. - mail.ts: OTP key is sha256(email) under otp:<digest> with 6h TTL. On Redis outage, validateOTP refuses with "Service temporairement indisponible" rather than bypassing or crashing. - auth.controller.ts: validateOTP is now awaited; the controller deletes the Redis key after a successful match. - mail.spec.ts: OTP coverage uses ioredis-mock; SMTP is mocked; disposable mail fixture is self-contained. - package.json: add ioredis, ioredis-mock.
Route remote CD SSH jobs through the existing proxy/bastion instead of relying on GitHub-hosted runner CIDR allowlists.
Add the bastion host key to known_hosts before CD jobs use ProxyJump for dataprep and deploy SSH commands.
Configure the CD runner SSH client so the ProxyJump bastion uses the deployed id_rsa_matchID identity explicitly.
Keep manual prod deploy from deleting the existing prod server, and refuse creating a third preserved prod server. CI passed on PR #64 after rerunning the checkout-only failure.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing this dev->master PR because the dev HEAD contains [skip ci], so CI did not run for the release PR. Reopening via a dedicated release branch with a non-skip empty release marker commit to get normal PR checks before merge. |
Summary
Release safety
Checks