Skip to content

Develop: v0.1.5#38

Merged
0xmanhnv merged 24 commits into
mainfrom
develop
Apr 9, 2026
Merged

Develop: v0.1.5#38
0xmanhnv merged 24 commits into
mainfrom
develop

Conversation

@0xmanhnv
Copy link
Copy Markdown
Collaborator

@0xmanhnv 0xmanhnv commented Apr 9, 2026

No description provided.

0xmanhnv added 22 commits April 8, 2026 07:30
…imezone bug

Phase 1.1 + 1.2 of scan feature improvements:

Migration 000102:
- Add scans.agent_preference column (was in domain entity but missing from DB)
- Add scans.profile_id column linking to scan_profiles
- Add scans.timeout_seconds column (default 3600, max 86400)
- Add FK scans.last_run_id -> pipeline_runs (referential integrity)
- Add FK pipeline_runs.scan_id -> scans (referential integrity)

Domain:
- Add Scan.ProfileID and Scan.TimeoutSeconds fields
- Add SetProfileID() and SetTimeoutSeconds() methods
- Fix calculateNextRun() to honor ScheduleTimezone (was using local time)
- Add proper time-of-day support for daily/weekly/monthly schedules
- Add DefaultScanTimeoutSeconds (3600) and MaxScanTimeoutSeconds (86400) constants

Repository:
- Consolidate scanFromRow/scanFromRows into shared readScan helper
- Add new columns to INSERT, UPDATE, and SELECT queries
- Add WithProfileRepo service option for cross-tenant profile validation

Service:
- CreateScanInput accepts profile_id, timeout_seconds
- Validate scan profile exists and belongs to tenant before linking
- System profiles (no tenant) allowed for everyone

Handler:
- CreateScanRequest exposes agent_preference, profile_id, timeout_seconds
- Wire request fields through to service input

CRITICAL: Run migration 000102 before deploying this binary, otherwise
SELECT queries will fail with "column does not exist" errors.
Phase 1.5 + 1.6 of scan feature improvements:

Phase 1.5 — Scan timeout enforcement:
- Add MarkTimedOutRuns method on pipeline.RunRepository
- Single SQL UPDATE joins pipeline_runs and scans, marks runs as timeout
  when EXTRACT(EPOCH FROM (NOW() - started_at)) > scan.timeout_seconds
- New ScanTimeoutController runs every 60s (registered alongside JobRecovery)
- Uses scan's per-config timeout instead of one-size-fits-all global threshold

Phase 1.6 — Distributed scheduler lock:
- Add TryLockScanForScheduler/UnlockScanForScheduler on scan.Repository
- Uses pg_try_advisory_lock(int4, int4) with namespaced key (SCAN namespace)
- Per-scan locking — multiple API replicas can process different scans in parallel
- Scheduler acquires lock before triggering, releases via defer
- Skips scan with debug log if already locked by another instance
- FNV-1a 32-bit hash converts UUID to lock key (rare collisions only cause
  brief serialization, not correctness issues)

Tests:
- Update 3 mock implementations with new repository methods
- All unit + integration tests pass
Phase 1.8 — Plug all the holes from Phase 1:

API surface fixes (the audit found these were missing):
- ScanDetailResponse now exposes agent_preference, profile_id, timeout_seconds
  (clients can finally see fields they POSTed)
- toScanResponse populates the new fields
- UpdateScanRequest/UpdateScanInput accept profile_id, timeout_seconds,
  agent_preference, max_retries, retry_backoff_seconds (was create-only)
- UpdateScan service properly handles all the new fields with sentinel
  semantics for profile_id (empty string = unlink)
- ScanConfigExport/ImportConfig handle profile_id and timeout_seconds
  (export-import roundtrip no longer loses data)
- Frontend ScanConfig type, CreateScanConfigRequest, UpdateScanConfigRequest
  expose the new fields

Security hardening (from cybersecurity audit):
- Add MinScanTimeoutSeconds = 30 constant + domain-layer enforcement
  in SetTimeoutSeconds (defense-in-depth — bypass HTTP validation
  still gets the floor)
- Validation tags now require min=30 (was min=1, allowed DoS via 1s timeouts)
- New scanprofile.Repository.GetAccessibleByID enforces tenant scope at SQL
  layer using `WHERE id = ? AND (tenant_id = ? OR tenant_id IS NULL)` —
  app-layer check is replaced with SQL-layer check (audit said "rely on SQL,
  not application logic")
- Scan service uses GetAccessibleByID for both Create and Update profile
  linking — system profiles still allowed, cross-tenant access returns
  ErrNotFound (no info leak via error messages)

Migration 103 (DB performance + Phase 2.3 schema):
- Add idx_scans_last_run for FK referential integrity check on
  pipeline_run delete (audit said missing index causes table scan)
- Add partial idx_pipeline_runs_pending_started for ScanTimeoutController
  sweeper query
- Drop unused idx_scans_profile (no query filters by profile_id)
- Add max_retries (0-10) and retry_backoff_seconds (10-86400) columns
- Add retry_attempt column on pipeline_runs

Phase 2.1 — Quality gate evaluation now actually works:
- Trigger paths (workflow + single scan) propagate sc.ProfileID to
  run.ScanProfileID. The pipeline service already evaluates the quality
  gate from this field on run completion — it just never had a value
  before because the link wasn't wired.
- End-to-end: scan has profile_id → run inherits ScanProfileID → quality
  gate evaluated when run completes → result stored in
  pipeline_runs.quality_gate_result

Phase 2.2 — Job cancellation:
- Add command.Repository.CancelByPipelineRunID — single SQL UPDATE that
  cancels all non-terminal commands belonging to a pipeline run via
  step_runs.pipeline_run_id JOIN, scoped by tenant
- pipeline.Service.CancelRun now calls CancelByPipelineRunID after
  marking the run as canceled, so in-flight commands are aborted instead
  of waiting for JobRecoveryController to time them out (10-30 min)

Phase 2.3 — Retry logic with exponential backoff:
- Domain: Scan.MaxRetries, Scan.RetryBackoffSeconds, SetRetryConfig,
  CalculateRetryDelay (exponential), ShouldRetry
- Domain: Run.RetryAttempt tracks the current attempt (0 = first)
- Repository: pipeline.RunRepository.ListPendingRetries — single query
  using DISTINCT ON + JOIN to find latest failed run per scan and check
  exponential backoff against retry_backoff_seconds * 2^retry_attempt
- New ScanRetryController polls every 60s, dispatches retries via the
  RetryDispatcher interface. Scan service implements RetryDispatcher
  via RetryScanRun method.
- New run gets retry_attempt = previous + 1, eventually stopping at
  max_retries

Repository refactoring:
- Scan repository handles all new columns in INSERT/UPDATE/SELECT
- Pipeline_run repository handles retry_attempt in INSERT/SELECT (both
  Create and CreateRunIfUnderLimit paths)

Tests:
- Update mocks for ListPendingRetries, CancelByPipelineRunID,
  GetAccessibleByID. All unit + integration tests pass.

CRITICAL: Run migration 102 AND 103 before/with this binary deploy.
Migration 102 adds the columns the binary references in SELECT queries.
Covers all the new functionality from Phase 1.x and Phase 2.x:

- Section 1-5: Setup (auth, evil tenant, asset, group, profile)
- Section 6: Phase 1.1+1.2 — create scan with new fields
- Section 7: Phase 1.8d — min timeout 30s + max 86400 validation
- Section 8: Phase 1.8d — cross-tenant profile rejection (security)
- Section 9: Phase 1.8b — UpdateScan handles profile_id (link/unlink),
  agent_preference, timeout_seconds, max_retries, retry_backoff_seconds
  with sentinel semantics and bound validation
- Section 10: Phase 1.8c — export/import roundtrip preserves new fields
- Section 11: Phase 2.1 — quality gate evaluation wiring (trigger)
- Section 12: Phase 1.3 — findings drill-down via /findings?scan_id=
- Section 13: Phase 2.2 — cancel run cascades to commands
- Section 14: tenant isolation on GET/UPDATE/DELETE (cross-tenant)
- Section 15: timezone validation, agent_preference enum, edge cases
- Section 16: stats and listing
- Section 17: cleanup

The script uses two tenants (primary + evil) to verify cross-tenant
isolation. Each test asserts both happy-path and security-path behavior.
User feedback: triggering a paused scan returned "scan is not validate" — confusing.

Changes:
- trigger.go: replace generic "scan is not active" with state-specific
  messages explaining WHY and HOW to fix:
    paused   -> "Cannot trigger '<name>' because it is paused. Resume the
                 scan first to trigger it."
    disabled -> "Cannot trigger '<name>' because it is disabled. Activate
                 the scan first to trigger it."
  Returns DomainError with code SCAN_NOT_TRIGGERABLE for client-side handling.

- scan_handler.go: introduce cleanErrorMessage helper that extracts the
  user-facing .Message field from shared.DomainError instead of dumping
  raw err.Error() (which prepends the code and appends ": validation").
  All ValidationError, AlreadyExists, etc. responses now show clean text.
Critical fixes from the multi-role audit:

PM/TechLead audit:
- GAP 1: ScanConfigExport missing max_retries/retry_backoff_seconds.
  Export-import roundtrip would silently lose retry config.
  Added fields to struct + ExportConfig + ImportConfig wiring.

- GAP 4: pipeline_run_repository.Update() missing retry_attempt in SET clause.
  After RetryScanRun set run.RetryAttempt and called Update(), the value
  was silently dropped — retry counter would always read 0 from DB.
  Added retry_attempt to UPDATE SET + parameter list.

- GAP 6: No tests for retry domain logic. Added scan_retry_test.go covering:
  * SetRetryConfig: defaults, valid, negative, over-max, zero/min/max backoff,
    boundary values, UpdatedAt mutation
  * CalculateRetryDelay: 1st/2nd/3rd retry, exponential growth, cap at max
  * ShouldRetry: max=0 never retries, attempts 0..max allowed, max+1 denied
  * Clone: preserves retry config, independent from original

Security audit:
- GAP 5: ScanRetryController could dispatch duplicate retries on crash/restart
  or with multiple controller instances.
  Fixed via:
  * New retry_dispatched_at column on pipeline_runs (migration 103)
  * ListPendingRetries now uses CTE + FOR UPDATE SKIP LOCKED + atomic UPDATE
    to set retry_dispatched_at = NOW(). The candidate is "claimed" before
    being returned, preventing concurrent dispatch.
  * Eligibility filter excludes runs where retry_dispatched_at IS NOT NULL.
  * Index idx_pipeline_runs_retry_pending optimizes the lookup.

User-reported UX:
- Trigger error "scan is not validate" was unhelpful.
  Fixed in commit 9d64188: state-specific messages explaining WHY (paused/
  disabled) and HOW to fix (resume/activate). cleanErrorMessage helper
  in handler strips DomainError code prefix and wrapped error suffix.

Tests:
- All unit tests pass (added 4 retry test groups, 24 test cases)
…verride

Solves the chicken-and-egg problem: new deployments without SMTP couldn't
register users because verification emails couldn't be sent, but verification
was required to log in.

Resolution order (most specific wins):
  1. Per-tenant SecuritySettings.EmailVerificationMode
     - "always" → require even if SMTP missing (operator must configure SMTP)
     - "never"  → skip verification (closed/internal deployments)
     - "auto"   → smart detection (default)
  2. SMTP availability auto-detection
     - tenant SMTP configured → require verification
     - system SMTP configured → require verification
     - no SMTP at all         → skip (graceful fallback)
  3. Global env config (AUTH_REQUIRE_EMAIL_VERIFICATION) — fallback only

Changes:
- Domain: SecuritySettings.EmailVerificationMode + EmailVerificationMode enum
  with "auto" / "always" / "never" values
- DefaultSettings: EmailVerificationMode defaults to "auto"
- SecuritySettings.Validate: rejects invalid modes
- AuthService.shouldRequireEmailVerification: encapsulates the resolution logic
- AuthService.SetSMTPChecker: optional dependency injection
- AuthService.Register: uses smart check, returns RequiresVerification reflecting actual decision
- AuthService.Login: only blocks unverified users when verification is currently required
- EmailService now implements SMTPAvailabilityCheck (HasSystemSMTP, HasTenantSMTP)
- main.go: wires Email service into AuthService for smart detection

Migration: NONE — uses existing JSONB tenants.settings column.
Existing tenants automatically get "auto" behavior (empty mode = auto path).

Production behavior:
- New deployment without SMTP → first user registers + auto-verified + can login
- After admin configures SMTP → next users get verification email as expected
- Tenant admin can force "always" or "never" via tenant settings UI (TODO: UI)
Operators can now edit agent config templates (yaml/env/docker/cli)
without rebuilding the frontend. Templates live in
configs/agent-templates/*.tmpl on the API host.

Backend:
- New configs/agent-templates/{yaml,env,docker,cli}.tmpl files using
  Go text/template syntax with helper funcs (toScannerName, firstTool,
  slugify)
- AgentConfigTemplateService loads templates from filesystem at startup,
  supports Reload() for live editing without restart, falls back to
  built-in defaults if any file is missing
- New GET /api/v1/agents/{id}/config-templates endpoint returns rendered
  templates as JSON
- AgentHandler gains SetTemplateService and SetPublicAPIURL injection
- New AgentConfig config block: AGENT_CONFIG_TEMPLATES_DIR (default
  configs/agent-templates) and AGENT_PUBLIC_API_URL (fallback to APP_URL)
- Dockerfile copies configs/ into the production image so the templates
  ship with the binary
Addresses 7 critical/high findings from the multi-role audit.

SECURITY CRITICAL — API key in URL (credential leak):
- Move agent config templates API key from query parameter to
  X-Agent-API-Key header. Query strings are logged by proxies, load
  balancers, CDNs, browser history, referer headers — embedding a
  credential there is a known leakage vector.
- Backend reads from header now (agent_handler.go:670).
- Frontend sends via header with AbortController for clean unmount.
- Next.js proxy forwards x-agent-api-key in route.ts allowlist.

SECURITY HIGH — Login email verification logic:
- Old: Login passed empty tenantID to shouldRequireEmailVerification, so
  the per-tenant override never applied. Tenant with mode="always" could
  not enforce verification at login.
- New: shouldRequireEmailVerificationForUser walks the user's tenant
  memberships and applies the strictest mode (always > never > auto).
- Falls back to global SMTP detection only if user has no memberships.

SECURITY HIGH — SMTP downgrade protection:
- If a user has a verification token (issued at registration when SMTP
  was required), Login NOW always denies even if SMTP is later removed.
- Prevents an attacker from triggering SMTP removal to auto-verify
  accounts stuck pending verification.

PERFORMANCE HIGH — Missing index on commands.step_run_id:
- CancelByPipelineRunID JOINs commands → step_runs via step_run_id
  but commands had no index on that column. Full table scan on every
  cancel.
- Migration 104 adds idx_commands_step_run_id (partial, NOT NULL).

PERFORMANCE HIGH — Non-sargable retry query:
- Old idx_pipeline_runs_retry_pending was (scan_id, completed_at), but
  the query orders by completed_at and filters on retry_dispatched_at
  IS NULL. The previous index was missing completed_at IS NOT NULL in
  its WHERE clause and had wrong column order.
- Migration 104 drops the old index and creates idx_pipeline_runs_retry_eligible
  on (completed_at ASC, scan_id) WHERE status='failed' AND
  retry_dispatched_at IS NULL AND completed_at IS NOT NULL.
- Also adds idx_pipeline_runs_tenant_status_created for paginated lists.

FEATURE HIGH — EmailVerificationMode in API:
- Add field to UpdateSecuritySettingsRequest, UpdateSecuritySettingsInput,
  SecuritySettingsResponse — operators can now actually configure
  per-tenant email verification mode via the API.
- Validation: oneof=auto always never
- Empty value preserves existing mode (don't accidentally reset to auto).

UX MEDIUM — Disable agent config buttons when content empty:
- Added disabled={!yamlConfig} (etc) so spamming buttons during initial
  render is impossible. Added aria-label for accessibility.
- Collapse GetAggregateStats from 6 round-trips to 1 (CTE + UNION ALL).
  Postgres plans a single scan of the filtered set for all aggregates;
  result is iterated as (category, key, value::float8) rows. ~83% reduction
  in DB traffic for the asset stats endpoint, which is hit on every page
  render of /assets/* (every 30s per tenant via SWR).

- GetAggregateStats now accepts a `tags` filter ([]string, overlap operator)
  so /assets/stats?types=...&tags=... mirrors List() semantics. This lets
  the UI keep stat cards consistent with whatever the user filtered to.

- UpdateAssetInput + UpdateAssetRequest now carry owner_ref so the field
  is editable post-create (previously create-only). Wired through service
  to entity.SetOwnerRef.

- Mock test files updated for the new repo signature.
GET /api/v1/agents/stats returns aggregate counts grouped by status,
health, type, and execution_mode for the calling tenant. Computed in a
single query using a CTE + UNION ALL — no client-side .filter().length
over a paginated list, which previously undercounted as soon as a tenant
had more than one page (per_page default = 20) of agents.

The endpoint also returns active_jobs (SUM(current_jobs) for online
daemon agents) and online_active for the headline cards.

Route is registered before /{id} so chi doesn't claim "stats" as a path
parameter. All six agent repository mocks gain a stub implementation.
…o 502

parseQueryArray
  Adds maxQueryArrayItems (100) and maxQueryArrayItemLen (200) caps so a
  pathological ?tags=a,b,...10000-items query can no longer allocate
  unbounded slices or pq arrays. Defense-in-depth raw input cap before
  splitting. Trim happens after the cap, so length limits are enforced.

integration_handler.handleServiceError
  When the SCM client returns a network error (DNS lookup failed, refused,
  i/o timeout, TLS handshake, EOF, network unreachable), the API used to
  log the raw error and return 500 INTERNAL_ERROR — leaking internal
  hostnames and DNS resolver addresses. Now: log details server-side at
  WARN, return 502 INTEGRATION_UNREACHABLE with a generic message
  ("Could not reach the integration provider..."). Aligns with the
  CLAUDE.md "generic error messages" rule.
…tats filter

Owner endpoints (POST/PUT/DELETE /assets/{id}/owners):
- Add per-request tenant-asset cross-check via new requireAssetInTenant
  helper. Closes IDOR vector where a user with assets:write in tenant A
  could craft a request against an asset_id in tenant B.
- Inject asset.Repository into AssetOwnerHandler.

Asset relationship Create handler:
- Use chi URLParam {id} as authoritative source asset id; reject body
  source_asset_id when it disagrees with the URL.
- Add placement mutex in service: runs_on and deployed_to are mutually
  exclusive for the same source/target pair (UI hides candidates,
  backend enforces for non-UI clients).

Finding stats endpoint:
- Add optional asset_id query param so the severity cards on
  /findings?assetId=… reflect the filtered table instead of global
  tenant counts. Plumbed through repo (4-arg GetStats), service
  (GetFindingStatsInput.AssetID) and handler. All 8 mock implementations
  in tests/ updated to match the new signature.

Pentest campaign members repository:
- Fix 'column u.display_name does not exist' error: rename to u.name
  (same pattern as the earlier fix in access_control_repository).

Compliance module:
- Add ModuleCompliance constant and entry to UserFacingModuleIDs so
  tenant admins can disable it from Settings → Modules.
- ModulePermissionMapping entry now uses the constant instead of the
  bare 'compliance' string for consistency.

New migrations:
- 000105_compliance_module_seed: insert the missing compliance row into
  the modules registry. The compliance feature was implemented end-to-end
  but missing from the seed, so the sidebar entry pointed at a phantom.
- 000106_users_trgm_search_indexes: GIN trigram indexes on users(name)
  and users(email) so the owner picker server-side ILIKE search scales
  past sequential-scan range. CONCURRENTLY removed because golang-migrate
  wraps each file in a transaction; comment documents the manual workaround
  for very large existing deployments.
Relationship types codegen:
- New configs/relationship-types.yaml is the single source of truth
  for every relationship type ID, label, description, and constraint.
- New cmd/gen-relationships Go program reads the YAML and emits two
  generated files: pkg/domain/asset/relationship_types_generated.go
  (Go enum + RelationshipTypeRegistry + categories) and
  ../ui/src/features/assets/types/relationship.types.generated.ts
  (TS union + GENERATED_RELATIONSHIP_LABELS + constraint maps).
- Manual constants in pkg/domain/asset/relationship.go are removed;
  AllRelationshipTypes() now wraps the generated slice.
- New `make generate-relationships` Makefile target.
- Workflow: edit YAML → run target → commit YAML + the two generated
  files. Generated files have a "DO NOT EDIT" header.
- Validation: codegen errors out on duplicate IDs, unknown categories,
  empty constraints, etc.

Migration 000107_relationship_deployed_to_backfill:
- One-shot data fix for relationship rows that were created under the
  old loose constraint table where service/api/website could be a
  deployed_to source. Per CMDB best practice, deployed_to is artifact-
  only — runtime workloads belong in runs_on.
- Step (a): DELETE rows where a runs_on row already exists between the
  same pair (pure duplicates).
- Step (b): UPDATE the remaining bad rows to relationship_type='runs_on',
  preserving description / confidence / impact_weight / tags.
- Marked "review before applying" in the file header. Down migration is
  intentionally a no-op (the data fix is one-way).
ESLint blocks @ts-nocheck via ban-ts-comment. The generated TS file is
type-safe so the directive was unnecessary belt-and-braces. Removed
from the template; pre-commit lint passes.
Honest evaluation of the relationship registry surfaced multiple issues
the previous version inherited from generic CMDB best practice without
validating against the actual platform model. This commit acts on that
evaluation:

Type registry changes (configs/relationship-types.yaml):
- DROP `owned_by` — duplicates the proper asset_owners table (RACI
  ownership lives there, not as a fake "service-as-team" edge).
- DROP `member_of` — the inverse of `contains`. Hierarchy now has a
  single canonical direction (parent → child via `contains`). Existing
  ingest pipeline (subdomain → domain) updated to use the new direction.
- TIGHTEN `resolves_to` to STRICT semantic — targets are now only
  `[ip_address, load_balancer]`. Previously also accepted `host` and
  `cloud_account` which contradicted the description ("DNS resolves to
  IP / CDN / load balancer"). Strict semantic is more flexible long-term:
  it enables cross-relationship analytics ("all domains resolving to a
  compromised IP"), supports external scanner integration, and lets new
  IP-bearing types (subnet, vip, anycast) extend cleanly.
- ADD `cname_of` — DNS alias / subdomain → parent domain edges that
  resolves_to was being misused for.
- ADD `peer_of` — symmetric edge for HA pairs, sibling services,
  leader/follower nodes. Previously had to be modeled as two depends_on
  edges in opposite directions which was wrong.
- ADD `replicates_to` — infra-level state copy (DB primary → replica,
  log shipper → remote). Distinct from sends_data_to which is
  application-level data flow.
- ADD `has_access_to` — the missing IAM resource direction. The old
  `granted_to` conflated "credential granted to principal" with
  "principal has access to resource"; now split.
- TIGHTEN `granted_to` — restricted to credential → principal
  (matches the IAM grant semantic). Resource-side targets moved to
  has_access_to.
- TIGHTEN descriptions for `monitors` vs `protected_by` — the
  distinction is now explicit (passive observability vs active
  prevention) so users have a rule for ambiguous tools (EDR, WAF).
- RENAME category `control_and_ownership` → `control_and_observability`
  (since `owned_by` is gone).

Subdomain ingest fix (internal/app/ingest/processor_assets.go):
- Updated to use the new canonical hierarchy direction
  `parent_domain contains subdomain` instead of the dropped
  `subdomain member_of parent_domain`.

Usage stats endpoint:
- New `RelationshipRepository.CountByType(tenantID)` aggregates
  relationship counts per type via a single GROUP BY query.
- New service method `GetRelationshipTypeUsage(tenantID)` joins the
  per-tenant counts with the registry metadata so admins can see which
  types are unused (count=0) and prune the registry based on real data.
- New endpoint `GET /api/v1/relationships/usage-stats` exposes it.
- Returns every registered type, including zero-count entries, in
  YAML declaration order for stable response ordering.

Tests:
- Removed dropped types from the AllRelationshipTypes test list and
  added the new ones.
- Replaced the hard-coded `expected 16` count with
  `len(asset.AllRelationshipTypes())` so the test stays in sync with
  the registry.
- Added `MockRelationshipRepository.CountByType` implementation.
New comprehensive guide at docs/development/relationship-types.md
covering:
- The 30-second mental model (YAML → codegen → Go + TS)
- Quickstart for adding a new type end to end
- File map showing what's manual vs generated
- Full YAML schema reference with field rules and naming conventions
- Workflows for add / modify / rename labels / remove / category changes
- Validation rules and what each codegen error means
- How the generated files are wired into the rest of the codebase
- Telemetry — using the usage-stats endpoint to drive registry pruning
- Limitations and known footguns (asset type IDs not validated, etc.)
- Troubleshooting common errors
- Design rationale FAQ (why YAML, why AOT codegen, why global registry,
  why some types are missing, etc.)
- Cheat sheet at the bottom

Cross-link added to the YAML file's header comment so anyone editing
the registry sees the doc reference up front.
cmd/gen-relationships/main_test.go:
- 17 tests covering every validate() error path: empty categories,
  empty types, missing/duplicate IDs, unknown category references,
  empty labels, missing constraints, empty sources/targets in a
  constraint.
- Tests for the template helpers (toCamel, goStringList, tsStringList,
  trimDesc) including the YAML-folded multiline collapse path.
- TestProductionYAML_LoadsAndValidates pins the production YAML
  through the parser + validator, so a future edit that produces an
  invalid config fails the test before it can be merged. Catches
  drift before it hits the build pipeline.

Migration 000108_asset_relationships_type_index:
- Adds idx_asset_relationships_tenant_type covering (tenant_id,
  relationship_type). The /relationships/usage-stats endpoint runs
  WHERE tenant_id = $1 GROUP BY relationship_type — the existing
  unique constraint covers the WHERE but is wider than needed,
  forcing more page reads. The narrower covering index lets PG do
  an Index Only Scan + HashAggregate, significantly faster on tenants
  with hundreds of thousands of relationships.
- Trivial cost (~16 bytes per row) for meaningful read perf.
POST /api/v1/assets/{id}/relationships/batch — bulk-creates many
relationships from one source asset in a single round trip. Replaces
the previous N-parallel-POST flow used by the multi-select dialog.

Service (CreateRelationshipBatch in asset_relationship_service.go):
- Validates the tenant + source asset ONCE for the whole batch (the
  primary efficiency win — single Get-By-ID instead of N).
- Per-item: parses target ID, parses + validates relationship type,
  fetches target asset, runs the placement mutex check, builds the
  domain entity, persists.
- Per-item failures (invalid type, missing target, duplicate, mutex
  violation) mark the item as error/duplicate and continue. The
  whole batch never aborts because one item failed — every item
  gets a slot in the result.
- Maps unique-violation errors to BatchCreateStatusDuplicate so the
  UI can show "already related" instead of "internal error".
- The result includes per-item index (matching the request position
  so the frontend can map back to target names without re-fetching),
  per-item status, and aggregate created/duplicates/errors counts.

Handler (BatchCreate in asset_relationship_handler.go):
- New BatchCreateRelationshipRequest with `items` array (validator
  caps at 100 entries to prevent abuse via giant payloads).
- Always returns 200 with the result body — per-item failures are
  inside, not at the HTTP layer.
- Whole-batch failures (bad tenant, missing source asset) still
  return the appropriate error response via handleServiceError.

Route registration: POST /api/v1/assets/{id}/relationships/batch
gated by permission.AssetsWrite, same tenant middleware chain as
the singleton create endpoint.

Tests (asset_relationship_service_test.go):
- 7 cases for CreateRelationshipBatch: happy path all-succeed, mixed
  one-invalid-target, invalid-type-marks-as-error, missing source
  asset, malformed tenant ID, empty input, and a contract test that
  pins the result Index matching input position (the frontend depends
  on this).
- 4 cases for GetRelationshipTypeUsage: returns every registered type
  with count (including unused types as count=0), includes metadata
  from the registry, validates tenant ID format, empty tenant returns
  all-zero entries.

11 new tests, all passing.
…n member removal

Session / refresh_token table growth:
- The CleanupExpiredSessions() method existed in session_service.go
  (DELETE FROM sessions WHERE expires_at < NOW() OR status IN
  ('expired', 'revoked') + DELETE FROM refresh_tokens WHERE
  expires_at < NOW()) but was NEVER wired into any background worker.
  Every login created rows, every logout marked them 'revoked' (not
  deleted), every refresh-token rotation added new rows with old ones
  marked as 'used'. All accumulated forever.
- New SessionCleanupTicker in workers.go: runs every 1 hour, calls
  CleanupExpiredSessions. Also fires once at startup to clear any
  backlog from when this worker didn't exist. Timeout 5 minutes per
  run. Logs counts when rows are actually deleted.
- Same pattern as the existing NotificationCleanupTicker (24h daily)
  but on a tighter schedule because sessions expire faster than
  notifications' 90-day retention.

Invitation re-accept after member removal (privilege escalation fix):
- When an admin removes a user from a tenant, any PENDING (unaccepted)
  invitations for that user's email in that tenant are now auto-deleted.
  Without this, a removed user could click their original invitation
  link and rejoin the tenant, bypassing the admin's revocation intent.
- New repo method: DeletePendingInvitationsByUserID(tenantID, userID).
  Uses a JOIN with the users table so the service doesn't need to
  fetch the user's email separately — single SQL statement, case-
  insensitive email match.
- Wired into TenantService.RemoveMember() as best-effort: failure is
  logged but doesn't roll back the membership delete (the primary
  security goal — revoking access — has already succeeded).
- Already-accepted invitations are NOT deleted (they're a historical
  audit record with no active token).
- Updated 4 test mock files to satisfy the new interface method.
POST /api/v1/tenants/{tenant}/invitations/{invitationId}/resend — admin-only
endpoint that re-sends the invitation email for a pending invitation
without changing the token, expiry, or any other metadata. Idempotent:
calling it 3 times sends 3 emails, all containing the same token link.

Used when the original invitation email was lost to spam filters or the
recipient's email changed. The previous workaround was delete + recreate,
which invalidated the old token — if the user had already clicked the
link once but didn't finish signing up, their link would be dead.

Service (ResendInvitation in tenant_service.go):
- Validates tenant isolation, checks invitation is pending + not expired
- Looks up inviter name + tenant name for the email template (same logic
  as CreateInvitation)
- Re-enqueues via emailEnqueuer.EnqueueTeamInvitation with the remaining
  TTL (not the original 7 days) so the email reflects the actual expiry
- Audit logged as ActionInvitationCreated with "resent" message

Handler (ResendInvitation in tenant_handler.go):
- Requires RequireTeamAdmin() middleware
- Returns 200 with JSON message on success
- Returns 400 if invitation is already accepted or expired
- Returns 404 if invitation doesn't exist or belongs to another tenant

Route: POST /tenants/{tenant}/invitations/{invitationId}/resend
registered before the DELETE /{invitationId} route so chi matches the
literal /resend path before the param pattern.
Three-state membership lifecycle: active → suspended → reactivate / remove

Migration 000109_tenant_member_suspension:
- Adds status VARCHAR(20) DEFAULT 'active', suspended_at, suspended_by
  to tenant_members. Existing rows default to 'active' (no data change).
- CHECK constraint: status IN ('active', 'suspended')
- Partial index on (tenant_id) WHERE status='suspended' for admin queries.

Domain (pkg/domain/tenant/membership.go):
- New MemberStatus type (active | suspended)
- Suspend(by) — sets status='suspended', records who and when. Rejects
  owner suspensions.
- Reactivate() — sets status='active', clears suspended_at/suspended_by.
- IsSuspended(), Status(), SuspendedAt(), SuspendedBy() getters.
- ReconstituteMembershipWithStatus() for loading from DB.

Service (internal/app/tenant_service.go):
- SuspendMember(membershipID, actx) — calls domain Suspend, persists,
  invalidates permissions + sessions, wipes pending invitations (same
  cleanup as RemoveMember). Audit logged with severity=high.
- ReactivateMember(membershipID, actx) — calls domain Reactivate,
  persists. Permissions load lazily on next request. Audit logged.

Repository (internal/infra/postgres/tenant_repository.go):
- UpdateMembershipStatus(membership) — single UPDATE touching status,
  suspended_at, suspended_by columns.

Handler (internal/infra/http/handler/tenant_handler.go):
- POST /tenants/{t}/members/{id}/suspend → SuspendMember
- POST /tenants/{t}/members/{id}/reactivate → ReactivateMember
- Both require RequireTeamAdmin() middleware.

Routes (internal/infra/http/routes/tenant.go):
- Registered before DELETE /members/{memberId} so chi matches literal
  /suspend and /reactivate paths before the param pattern.

Tests: 4 mock files updated with UpdateMembershipStatus stub.
@0xmanhnv 0xmanhnv changed the title Develop Develop: v0.1.5 Apr 9, 2026
Comment thread internal/infra/postgres/asset_repository.go Fixed
Comment thread internal/infra/postgres/asset_repository.go Fixed
@0xmanhnv 0xmanhnv merged commit b78800c into main Apr 9, 2026
27 checks passed
0xmanhnv added 2 commits April 9, 2026 13:08
- Filter out suspended memberships from GetUserMemberships at the SQL
  level so they cannot be used to mint tenant-scoped JWTs at login,
  refresh, or token-exchange.
- Load membership status in GetMembership / GetMembershipByID and reject
  suspended members from RequireMembership middleware on every
  tenant-scoped request. This closes the gap where a suspended user's
  still-valid JWT could keep working until expiry.
- Add unit tests covering the active / suspended / not-a-member paths.
The member suspension feature shipped with a backend gap: ListMembers /
SearchMembers / GetMemberByEmail / GetMemberStats all selected u.status
(the user-level status field) instead of m.status (the membership-level
field added in migration 109). After an admin clicked Suspend, the API
correctly wrote tenant_members.status='suspended' and middleware
correctly rejected the suspended user — but the response payload still
reported status='active', so the UI badge, the "Suspended" filter tab,
the per-row Suspend/Reactivate toggle, and the "Active members" stats
card all silently lied to the operator.

This commit switches all four queries to COALESCE(m.status, 'active'),
making the membership lifecycle the single source of truth for "is this
person currently allowed in this tenant". User-level status is left in
place for the login-time check (defense-in-depth against direct DB
manipulation) but is no longer surfaced to tenant admins.

Also drop a stray ineffectual idx++ in asset_repository.GetAggregateStats
that the linter flagged — it was the trailing increment after the last
filter branch and added no value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants