docs: customer-facing logs design (AI Edge v1) by mattdjenkinson · Pull Request #72 · milo-os/telemetry

mattdjenkinson · 2026-05-13T20:16:00Z

Summary

Proposes a v1 customer-facing logs pipeline for the Datum platform, scoped to AI Edge (HTTPProxy access logs and WAF events) as the first producer. Doc lives at docs/architecture/customer-facing-logs.md.

What's in the design

Service declaration: ServiceConfiguration.spec.logs[] fans out to a new telemetry.LogDefinition CRD; spec.monitoredResourceTypes[] additionally fans out to a new telemetry.MonitoredResourceType CRD alongside the existing billing fan-out.
Ingestion: AI Edge data plane emits OTLP to a regional OTel Collector gateway that stamps cloud.account.id, validates resource attributes against the declared label vocabulary, derives tenant_id, and writes to ClickHouse.
Storage: Shared platform_logs table, tenant_id first in ORDER BY and partition key, log_id and resource_type promoted to top-level columns.
Query: Loki-compatible HTTP API at /projects/{project}/telemetry/loki/api/v1/.... URL is authoritative for tenancy; X-Scope-OrgID ignored. Labels/series served from the MonitoredResourceType catalog so Grafana works on empty projects.
Retention: 7d default for allLogs, 400d for audit; per-category overrides applied via a TTL column at write time.
Quota: New LogIngestionQuota resource on (project, category_group); 429 + drop counter on exceed, no silent drops.
Default enablement: allLogs opt-in via LogCollectionPolicy; audit always-on.
Redaction: Attribute-level only in v1; platform allowlist + customer-configurable LogRedactionPolicy. Body is not redacted.

Scope

v1 producer is AI Edge only: two LogDefinitions on networking.datumapis.com/HTTPProxy — access log (allLogs) and WAF events (allLogs + audit). Control-plane audit logs (covered by milo-os/activity) integrate later via the shared catalog. LogSource in ExportPolicy is deferred.

Open questions

Captured explicitly at the end of the doc: live tail backend (poll vs. Kafka), LogCollectionPolicy granularity, the LogQL subset to support in v1, and how catalog-backed label discovery handles tenant-specific label values.

Add a design document for the v1 customer-facing logs pipeline, scoped to AI Edge (HTTPProxy access logs and WAF events) as the first producer. Key decisions captured: - Service declaration via ServiceConfiguration.spec.logs[] fanning out to telemetry.LogDefinition and telemetry.MonitoredResourceType CRDs - OTel Collector gateway stamps tenant identity and validates the label vocabulary declared in MonitoredResourceType - Shared ClickHouse platform_logs table, tenant_id as the first ORDER BY column and partition key - Loki-compatible HTTP API exposed under /projects/{project}/telemetry/loki/api/v1/... with URL-based tenancy - Catalog-backed labels/series discovery so Grafana populates the stream selector UI on empty projects - Tiered retention defaults (7d allLogs, 400d audit), opt-in collection for non-audit categories, attribute-level redaction only Open questions are listed explicitly: live tail backend, policy granularity, the LogQL subset to support in v1, and how label-value discovery handles tenant-specific values.

mattdjenkinson · 2026-05-14T09:14:43Z

@bmertens-datum can't add you as a reviewer but tagging you here so you can take a look.

Add a C4 container diagram of the customer-facing logs ingestion pipeline and reference it from the design doc. Include a docs Taskfile mirroring milo-os/activity so future PlantUML sources render to PNG via the same docker plantuml workflow.

mattdjenkinson · 2026-05-18T09:57:18Z

@scotwells @kevwilliams @savme would be good to get your thoughts here when you get some time. It would be nice to have something working for AI Edge over the next couple of weeks.

scotwells

Good first draft!

scotwells · 2026-05-18T21:00:04Z

+2. Stamp `cloud.account.id` (Milo project ID) immutably from the caller's
+   workload identity — customers cannot override.


We won't be able to rely on workload identity for resolving the project ID because the source of logs will typically be something a service's component (e.g. Envoy) is outputting to the log sink and not a consumer's application.

Instead, we just need to ensure that all logs flowing into the gateway have tenancy labels on it:

tenant.kind - The type of tenant that generated the log (e.g. Project, Organization, User)

tenant.name - The resource name of the tenant (e.g. personal-project-xyz)

Good catch. Updated the design so tenancy travels on the record itself: every log entering the gateway must carry tenant.kind and tenant.name, the gateway resolves tenant_id from those via the project catalog, and records without them are rejected. Producing services own stamping both tenancy and resource-identity labels; the gateway just validates and resolves.

scotwells · 2026-05-18T21:02:52Z

+   `resource_type` and validate that emitted resource attributes are a
+   subset of the declared label vocabulary. Reject undeclared labels.
+4. Derive `tenant_id` from `cloud.account.id`.
+5. Write to ClickHouse via the `clickhouse` exporter.


We should put NATS in between the gateway and clickhouse so we can backpressure to NATS if Clickhouse is down. This also gives us the benefit of being able to use NATS for real-time log streaming for live tailing.

Agreed. Added a NATS JetStream subject between the gateway and ClickHouse, with a separate writer consumer draining into platform_logs. Two benefits called out in the doc: backpressure when ClickHouse is unhealthy (the gateway no longer has to drop), and the same stream feeding the Loki /tail handler so live tail doesn't have to poll ClickHouse. Removed the prior "ClickHouse polling vs. Kafka" open question, NATS is the answer.

scotwells · 2026-05-18T21:03:35Z

+3. Look up the declared `MonitoredResourceType` for the entry's
+   `resource_type` and validate that emitted resource attributes are a
+   subset of the declared label vocabulary. Reject undeclared labels.
+4. Derive `tenant_id` from `cloud.account.id`.


We should leave this to be the tenant kind / name information.

Done, the gateway now resolves tenant_id from (tenant.kind, tenant.name) via the project catalog instead of from cloud.account.id. cloud.account.id stamping is removed from the design.

scotwells · 2026-05-18T21:07:06Z

+Services are responsible for stamping the instance-identifying labels
+(e.g. `resource.name`, `resource.namespace`, `hostname`). The gateway
+enforces the vocabulary; it does not inject instance identity.


This should include resource.kind and resource.group.

Added. The MonitoredResourceType wording for HTTPProxy now includes resource.group and resource.kind alongside resource.name / resource.namespace, and the catalog example is updated to match. Services are expected to stamp all four.

scotwells · 2026-05-18T21:09:25Z

+```sql
+CREATE TABLE platform_logs (
+    tenant_id           UInt32,
+    timestamp           UInt64,
+    observed_timestamp  UInt64,
+    severity_number     UInt8,
+    severity_text       LowCardinality(String),
+    body                String,
+    log_id              LowCardinality(String),
+    resource_type       LowCardinality(String),
+    attributes_string   Map(String, String),
+    resources_string    Map(String, String),
+    trace_id            String,
+    span_id             String
+)
+ENGINE = MergeTree()
+PARTITION BY (tenant_id, toYYYYMM(toDateTime(timestamp / 1e9)))
+ORDER BY (tenant_id, log_id, timestamp)
+TTL toDateTime(timestamp / 1e9) + INTERVAL 7 DAY DELETE;
+```


We'll want fields for the resource identity information so we can filter appropriately (group, kind, name, namespace). Will also want to add filters where appropriate. The most common querying format will probably be querying for a specific resource's logs followed by querying tenant level information.

e.g. "give me all access logs for proxy XYZ" and "give me all logs for projects X".

Promoted resource_group, resource_kind, resource_name, resource_namespace to top-level columns alongside log_id / resource_type, and reworked the sort key to (tenant_id, resource_type, resource_name, log_id, timestamp) so both "logs for proxy XYZ" and "logs for project X" hit the primary index. Also added a consumer_name column populated only on producer-destination rows, so the service-side query "give me logs for consumer X" is a top-level column filter rather than an attribute-map lookup. Called out the two query shapes explicitly under the schema.

scotwells · 2026-05-18T21:13:44Z

+### Ingestion Quota
+
+A new `telemetry.miloapis.com/LogIngestionQuota` resource integrates with
+the standard Milo quota system. Quota is dimensioned by
+`(project, category_group)` in bytes/second. On exceed:
+
+- Gateway returns 429 with `Retry-After`.
+- A per-tenant `telemetry_ingestion_dropped_bytes_total` counter is
+  exposed via the same Loki API so customers can see drops.
+- No silent drops.


We can remove quota for now. That'll be a follow-on enhancement.

Removed the LogIngestionQuota resource and the whole quota section. Listed it as a follow-on in Non-Goals and dropped it from the v1 delivery slice.

scotwells · 2026-05-18T21:13:59Z

+### Retention
+
+Fixed tiered defaults; no free-form per-project retention in v1.
+
+| Category Group | Default Retention | Disable-able |
+|---|---|---|
+| `allLogs`      | 7 days    | Yes (opt-in collection) |
+| `audit`        | 400 days  | No (compliance signal)  |
+
+Paid retention overrides are applied per category group on a project, not
+per log ID. Implemented as a TTL adjustment column populated by the
+gateway at write time so existing rows are not rewritten when overrides
+change.


Retention won't be user-controllable.

Adjusted. Retention is now a flat platform-set default (7d for allLogs), implemented as the table TTL on timestamp. Removed the per-tenant TTL column and the paid-override mechanism; flagged per-project / per-category overrides as a follow-on enhancement.

scotwells · 2026-05-18T21:14:26Z

+- `LogDefinition.spec.categoryGroups` provides a secondary access
+  dimension: `audit` requires a distinct permission from `allLogs`,
+  matching GCP's `roles/logging.viewAccessor` pattern scoped to a log
+  view. The query layer filters out log IDs the caller cannot access
+  before executing the SQL.


Won't need this since this system won't handle audit logs.

Removed. The Access Control section is now just "standard Kubernetes RBAC on the project's telemetry endpoint", no audit vs. allLogs permission split, no separate access model, since the URL is the project control-plane endpoint.

scotwells · 2026-05-18T21:15:28Z

+GET  /projects/{project}/telemetry/loki/api/v1/query
+GET  /projects/{project}/telemetry/loki/api/v1/query_range
+GET  /projects/{project}/telemetry/loki/api/v1/labels
+GET  /projects/{project}/telemetry/loki/api/v1/label/{name}/values
+GET  /projects/{project}/telemetry/loki/api/v1/series
+GET  /projects/{project}/telemetry/loki/api/v1/tail


Maybe we should change this to GET {project-control-plane-endpoint}/telemetry/*? The real URL used by milo for projects is pretty long compared to what this is showing.

Agreed, the prior /projects/{project}/telemetry/... shape didn't match what Milo actually issues. Switched to
{project-control-plane-endpoint}/telemetry/loki/api/v1/... — same per-project control-plane URL Milo gives out for Kubernetes API access, with the telemetry handler mounted at /telemetry/.... No {project} placeholder in the path and no X-Scope-OrgID header; the project is resolved from the endpoint itself. Grafana base URL is just the endpoint + /telemetry/.

scotwells · 2026-05-18T21:22:12Z

+  destinations:
+    - audience: tenant
+    - audience: platform


I don't see anything that covers how this is used. I don't think this is a blocker to figure out right now since we can have all logs flowing through this pipeline be consumer facing.

Coming back to this — I think the right direction here is to follow GCP's consumer/producer pattern. Since Milo projects exist for both tenants and service producers, the routing falls out naturally from the project hierarchy. The proposal is to rename audience to type and use consumer/producer to match GCP's terminology:

- logID: networking.datumapis.com/httpproxy-access monitoredResourceType: networking.datumapis.com/HTTPProxy destinations: - type: consumer # written to the customer's project - type: producer # written to the networking service's project

A log entry gets written once per destination. The customer's copy lands in their project; the service team's copy lands in the networking service's project, with the originating customer preserved as a consumer_name label for filtering.

In practice this means a customer like ecommerce-co can query their own project and see only their data:

{log_id="networking.datumapis.com/httpproxy-access", resource_name="api-gateway"} | json | http_response_status_code >= 500

While the Datum networking team can query the networking service's project and see across all customers — or drill into a specific one when investigating a support ticket:

# All customers with elevated error rates sum by (consumer_name) ( rate({log_id="networking.datumapis.com/httpproxy-access"} | json | http_response_status_code >= 500 [5m]) ) # Drill into a specific customer {log_id="networking.datumapis.com/httpproxy-access", consumer_name="ecommerce-co"} | json | http_response_status_code >= 500

No special cross-tenant access grants needed — the networking team just needs IAM on their own service project.

It also opens up producer-only log types that customers never see:

- logID: networking.datumapis.com/waf-engine-internal destinations: - type: producer # internal only, never visible to customers

Adopted this end-to-end. Renamed audience → type on destinations[], and the values are now consumer / producer matching GCP. The gateway emits one record per declared destination: consumer lands in the originating tenant's project, producer lands in the service's producer project with consumer_name set to the originator. consumer_name is a top-level column on platform_logs so the cross-consumer queries you sketched work without map lookups. Added example LogQL for both the consumer-side ("logs for my proxy") and producer-side ("error rate by consumer" + "drill into consumer X") shapes. Also called out producer-only logs (no consumer destination) as a supported case for internal diagnostics that should never be visible to customers.

No cross-tenant grants needed on either side — each principal has IAM on whichever project's endpoint they're querying.

Adjust the customer-facing logs design based on PR #72 review. Key changes: - Tenancy travels on the log record as `tenant.kind` / `tenant.name` instead of being stamped by the gateway from workload identity, since log producers are typically service components (e.g. Envoy) writing to a sink rather than consumer-authored applications - Replace the `audience: tenant|platform` destination model with `type: consumer|producer` matching GCP. The gateway emits one record per destination; producer rows carry `consumer_name` so service teams can query their own producer project across all consumers without cross-tenant grants - Add NATS JetStream between the gateway and ClickHouse for backpressure and to back the Loki `/tail` handler without polling ClickHouse - Promote `resource_group`, `resource_kind`, `resource_name`, `resource_namespace`, and `consumer_name` to top-level columns and reorder the sort key to serve per-resource and per-tenant queries - Expose the Loki API under the project control-plane endpoint (`{project-control-plane-endpoint}/telemetry/...`) instead of a `/projects/{project}/...` path - Remove audit logs from scope; they're handled by `milo-os/activity` - Drop `LogIngestionQuota` and user-controllable retention from v1; both move to follow-on enhancements Regenerate ingestion-pipeline diagram to match.

Extend the AI Edge access and WAF log entry schemas to support a per-request lifecycle view (firewall decision, geo/routing) keyed off a shared correlation ID. Key changes: - Add http.request.id (Envoy x-request-id) to both httpproxy-access and httpproxy-waf so a single request_id filter returns the access entry and every WAF rule that fired on the request - Denormalise waf.outcome and waf.matched_rules onto the access log so "show me blocked requests" is a single stream filter, not a join; per-rule detail stays on the WAF log for drill-down - Add user_agent.original to the access log - Add edge.pop.ingress (and edge.pop.upstream on the access log) to carry the ingress / upstream PoP per request. PoP is emission context, not resource identity, so it lives on the entry schema not the MonitoredResourceType - Add a Request Correlation subsection explaining the join key and the denormalised-summary pattern

Remove edge.pop.ingress and client.address from the WAF entry schema. Both are already on the paired access log and reachable by joining on http.request.id; duplicating them on every matched-rule row is waste, and the WAF has no concept of edge.pop.upstream anyway since it runs before the routing decision. Clarify in the Request Correlation subsection that the WAF schema is deliberately lean — any context that exists on the access log is reached via the join rather than copied per matched rule.

mattdjenkinson · 2026-05-19T11:07:52Z

@scotwells thanks for your feedback, have left notes on your comments and updated the doc. I've also added some bits in about using the request.id to join some entries.

mattdjenkinson requested review from drewr, kevwilliams, savme and scotwells May 13, 2026 20:20

mattdjenkinson mentioned this pull request May 13, 2026

Customer-facing logs for Datum Cloud products (AI Edge, Compute, …) datum-cloud/enhancements#714

Open

docs: add ingestion pipeline diagram

50bd415

Add a C4 container diagram of the customer-facing logs ingestion pipeline and reference it from the design doc. Include a docs Taskfile mirroring milo-os/activity so future PlantUML sources render to PNG via the same docker plantuml workflow.

scotwells requested changes May 18, 2026

View reviewed changes

mattdjenkinson added 3 commits May 19, 2026 11:21

scotwells mentioned this pull request May 21, 2026

Observability design for the compute service datum-cloud/compute#99

Open

		2. Stamp `cloud.account.id` (Milo project ID) immutably from the caller's
		workload identity — customers cannot override.

Conversation

mattdjenkinson commented May 13, 2026

Summary

What's in the design

Scope

Open questions

Uh oh!

mattdjenkinson commented May 14, 2026

Uh oh!

mattdjenkinson commented May 18, 2026

Uh oh!

scotwells left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdjenkinson commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattdjenkinson commented May 19, 2026 •

edited

Loading