docs: customer-facing logs design (AI Edge v1)#72
Conversation
Add a design document for the v1 customer-facing logs pipeline, scoped
to AI Edge (HTTPProxy access logs and WAF events) as the first
producer.
Key decisions captured:
- Service declaration via ServiceConfiguration.spec.logs[] fanning out
to telemetry.LogDefinition and telemetry.MonitoredResourceType CRDs
- OTel Collector gateway stamps tenant identity and validates the
label vocabulary declared in MonitoredResourceType
- Shared ClickHouse platform_logs table, tenant_id as the first
ORDER BY column and partition key
- Loki-compatible HTTP API exposed under
/projects/{project}/telemetry/loki/api/v1/... with URL-based tenancy
- Catalog-backed labels/series discovery so Grafana populates the
stream selector UI on empty projects
- Tiered retention defaults (7d allLogs, 400d audit), opt-in
collection for non-audit categories, attribute-level redaction only
Open questions are listed explicitly: live tail backend, policy
granularity, the LogQL subset to support in v1, and how label-value
discovery handles tenant-specific values.
|
@bmertens-datum can't add you as a reviewer but tagging you here so you can take a look. |
Add a C4 container diagram of the customer-facing logs ingestion pipeline and reference it from the design doc. Include a docs Taskfile mirroring milo-os/activity so future PlantUML sources render to PNG via the same docker plantuml workflow.
|
@scotwells @kevwilliams @savme would be good to get your thoughts here when you get some time. It would be nice to have something working for AI Edge over the next couple of weeks. |
| 2. Stamp `cloud.account.id` (Milo project ID) immutably from the caller's | ||
| workload identity — customers cannot override. |
There was a problem hiding this comment.
We won't be able to rely on workload identity for resolving the project ID because the source of logs will typically be something a service's component (e.g. Envoy) is outputting to the log sink and not a consumer's application.
Instead, we just need to ensure that all logs flowing into the gateway have tenancy labels on it:
tenant.kind- The type of tenant that generated the log (e.g. Project, Organization, User)tenant.name- The resource name of the tenant (e.g.personal-project-xyz)
There was a problem hiding this comment.
Good catch. Updated the design so tenancy travels on the record itself: every log entering the gateway must carry tenant.kind and tenant.name, the gateway resolves tenant_id from those via the project catalog, and records without them are rejected. Producing services own stamping both tenancy and resource-identity labels; the gateway just validates and resolves.
| `resource_type` and validate that emitted resource attributes are a | ||
| subset of the declared label vocabulary. Reject undeclared labels. | ||
| 4. Derive `tenant_id` from `cloud.account.id`. | ||
| 5. Write to ClickHouse via the `clickhouse` exporter. |
There was a problem hiding this comment.
We should put NATS in between the gateway and clickhouse so we can backpressure to NATS if Clickhouse is down. This also gives us the benefit of being able to use NATS for real-time log streaming for live tailing.
There was a problem hiding this comment.
Agreed. Added a NATS JetStream subject between the gateway and ClickHouse, with a separate writer consumer draining into platform_logs. Two benefits called out in the doc: backpressure when ClickHouse is unhealthy (the gateway no longer has to drop), and the same stream feeding the Loki /tail handler so live tail doesn't have to poll ClickHouse. Removed the prior "ClickHouse polling vs. Kafka" open question, NATS is the answer.
| 3. Look up the declared `MonitoredResourceType` for the entry's | ||
| `resource_type` and validate that emitted resource attributes are a | ||
| subset of the declared label vocabulary. Reject undeclared labels. | ||
| 4. Derive `tenant_id` from `cloud.account.id`. |
There was a problem hiding this comment.
We should leave this to be the tenant kind / name information.
There was a problem hiding this comment.
Done, the gateway now resolves tenant_id from (tenant.kind, tenant.name) via the project catalog instead of from cloud.account.id. cloud.account.id stamping is removed from the design.
| Services are responsible for stamping the instance-identifying labels | ||
| (e.g. `resource.name`, `resource.namespace`, `hostname`). The gateway | ||
| enforces the vocabulary; it does not inject instance identity. |
There was a problem hiding this comment.
This should include resource.kind and resource.group.
There was a problem hiding this comment.
Added. The MonitoredResourceType wording for HTTPProxy now includes resource.group and resource.kind alongside resource.name / resource.namespace, and the catalog example is updated to match. Services are expected to stamp all four.
| ```sql | ||
| CREATE TABLE platform_logs ( | ||
| tenant_id UInt32, | ||
| timestamp UInt64, | ||
| observed_timestamp UInt64, | ||
| severity_number UInt8, | ||
| severity_text LowCardinality(String), | ||
| body String, | ||
| log_id LowCardinality(String), | ||
| resource_type LowCardinality(String), | ||
| attributes_string Map(String, String), | ||
| resources_string Map(String, String), | ||
| trace_id String, | ||
| span_id String | ||
| ) | ||
| ENGINE = MergeTree() | ||
| PARTITION BY (tenant_id, toYYYYMM(toDateTime(timestamp / 1e9))) | ||
| ORDER BY (tenant_id, log_id, timestamp) | ||
| TTL toDateTime(timestamp / 1e9) + INTERVAL 7 DAY DELETE; | ||
| ``` |
There was a problem hiding this comment.
We'll want fields for the resource identity information so we can filter appropriately (group, kind, name, namespace). Will also want to add filters where appropriate. The most common querying format will probably be querying for a specific resource's logs followed by querying tenant level information.
e.g. "give me all access logs for proxy XYZ" and "give me all logs for projects X".
There was a problem hiding this comment.
Promoted resource_group, resource_kind, resource_name, resource_namespace to top-level columns alongside log_id / resource_type, and reworked the sort key to (tenant_id, resource_type, resource_name, log_id, timestamp) so both "logs for proxy XYZ" and "logs for project X" hit the primary index. Also added a consumer_name column populated only on producer-destination rows, so the service-side query "give me logs for consumer X" is a top-level column filter rather than an attribute-map lookup. Called out the two query shapes explicitly under the schema.
| ### Ingestion Quota | ||
|
|
||
| A new `telemetry.miloapis.com/LogIngestionQuota` resource integrates with | ||
| the standard Milo quota system. Quota is dimensioned by | ||
| `(project, category_group)` in bytes/second. On exceed: | ||
|
|
||
| - Gateway returns 429 with `Retry-After`. | ||
| - A per-tenant `telemetry_ingestion_dropped_bytes_total` counter is | ||
| exposed via the same Loki API so customers can see drops. | ||
| - No silent drops. |
There was a problem hiding this comment.
We can remove quota for now. That'll be a follow-on enhancement.
There was a problem hiding this comment.
Removed the LogIngestionQuota resource and the whole quota section. Listed it as a follow-on in Non-Goals and dropped it from the v1 delivery slice.
| ### Retention | ||
|
|
||
| Fixed tiered defaults; no free-form per-project retention in v1. | ||
|
|
||
| | Category Group | Default Retention | Disable-able | | ||
| |---|---|---| | ||
| | `allLogs` | 7 days | Yes (opt-in collection) | | ||
| | `audit` | 400 days | No (compliance signal) | | ||
|
|
||
| Paid retention overrides are applied per category group on a project, not | ||
| per log ID. Implemented as a TTL adjustment column populated by the | ||
| gateway at write time so existing rows are not rewritten when overrides | ||
| change. |
There was a problem hiding this comment.
Retention won't be user-controllable.
There was a problem hiding this comment.
Adjusted. Retention is now a flat platform-set default (7d for allLogs), implemented as the table TTL on timestamp. Removed the per-tenant TTL column and the paid-override mechanism; flagged per-project / per-category overrides as a follow-on enhancement.
| - `LogDefinition.spec.categoryGroups` provides a secondary access | ||
| dimension: `audit` requires a distinct permission from `allLogs`, | ||
| matching GCP's `roles/logging.viewAccessor` pattern scoped to a log | ||
| view. The query layer filters out log IDs the caller cannot access | ||
| before executing the SQL. |
There was a problem hiding this comment.
Won't need this since this system won't handle audit logs.
There was a problem hiding this comment.
Removed. The Access Control section is now just "standard Kubernetes RBAC on the project's telemetry endpoint", no audit vs. allLogs permission split, no separate access model, since the URL is the project control-plane endpoint.
| GET /projects/{project}/telemetry/loki/api/v1/query | ||
| GET /projects/{project}/telemetry/loki/api/v1/query_range | ||
| GET /projects/{project}/telemetry/loki/api/v1/labels | ||
| GET /projects/{project}/telemetry/loki/api/v1/label/{name}/values | ||
| GET /projects/{project}/telemetry/loki/api/v1/series | ||
| GET /projects/{project}/telemetry/loki/api/v1/tail |
There was a problem hiding this comment.
Maybe we should change this to GET {project-control-plane-endpoint}/telemetry/*? The real URL used by milo for projects is pretty long compared to what this is showing.
There was a problem hiding this comment.
Agreed, the prior /projects/{project}/telemetry/... shape didn't match what Milo actually issues. Switched to
{project-control-plane-endpoint}/telemetry/loki/api/v1/... — same per-project control-plane URL Milo gives out for Kubernetes API access, with the telemetry handler mounted at /telemetry/.... No {project} placeholder in the path and no X-Scope-OrgID header; the project is resolved from the endpoint itself. Grafana base URL is just the endpoint + /telemetry/.
| destinations: | ||
| - audience: tenant | ||
| - audience: platform |
There was a problem hiding this comment.
I don't see anything that covers how this is used. I don't think this is a blocker to figure out right now since we can have all logs flowing through this pipeline be consumer facing.
There was a problem hiding this comment.
Coming back to this — I think the right direction here is to follow GCP's consumer/producer pattern. Since Milo projects exist for both tenants and service producers, the routing falls out naturally from the project hierarchy. The proposal is to rename audience to type and use consumer/producer to match GCP's terminology:
- logID: networking.datumapis.com/httpproxy-access
monitoredResourceType: networking.datumapis.com/HTTPProxy
destinations:
- type: consumer # written to the customer's project
- type: producer # written to the networking service's projectA log entry gets written once per destination. The customer's copy lands in their project; the service team's copy lands in the networking service's project, with the originating customer preserved as a consumer_name label for filtering.
In practice this means a customer like ecommerce-co can query their own project and see only their data:
{log_id="networking.datumapis.com/httpproxy-access", resource_name="api-gateway"}
| json | http_response_status_code >= 500
While the Datum networking team can query the networking service's project and see across all customers — or drill into a specific one when investigating a support ticket:
# All customers with elevated error rates
sum by (consumer_name) (
rate({log_id="networking.datumapis.com/httpproxy-access"}
| json | http_response_status_code >= 500 [5m])
)
# Drill into a specific customer
{log_id="networking.datumapis.com/httpproxy-access", consumer_name="ecommerce-co"}
| json | http_response_status_code >= 500
No special cross-tenant access grants needed — the networking team just needs IAM on their own service project.
It also opens up producer-only log types that customers never see:
- logID: networking.datumapis.com/waf-engine-internal
destinations:
- type: producer # internal only, never visible to customersThere was a problem hiding this comment.
Adopted this end-to-end. Renamed audience → type on destinations[], and the values are now consumer / producer matching GCP. The gateway emits one record per declared destination: consumer lands in the originating tenant's project, producer lands in the service's producer project with consumer_name set to the originator. consumer_name is a top-level column on platform_logs so the cross-consumer queries you sketched work without map lookups. Added example LogQL for both the consumer-side ("logs for my proxy") and producer-side ("error rate by consumer" + "drill into consumer X") shapes. Also called out producer-only logs (no consumer destination) as a supported case for internal diagnostics that should never be visible to customers.
No cross-tenant grants needed on either side — each principal has IAM on whichever project's endpoint they're querying.
Adjust the customer-facing logs design based on PR #72 review. Key changes: - Tenancy travels on the log record as `tenant.kind` / `tenant.name` instead of being stamped by the gateway from workload identity, since log producers are typically service components (e.g. Envoy) writing to a sink rather than consumer-authored applications - Replace the `audience: tenant|platform` destination model with `type: consumer|producer` matching GCP. The gateway emits one record per destination; producer rows carry `consumer_name` so service teams can query their own producer project across all consumers without cross-tenant grants - Add NATS JetStream between the gateway and ClickHouse for backpressure and to back the Loki `/tail` handler without polling ClickHouse - Promote `resource_group`, `resource_kind`, `resource_name`, `resource_namespace`, and `consumer_name` to top-level columns and reorder the sort key to serve per-resource and per-tenant queries - Expose the Loki API under the project control-plane endpoint (`{project-control-plane-endpoint}/telemetry/...`) instead of a `/projects/{project}/...` path - Remove audit logs from scope; they're handled by `milo-os/activity` - Drop `LogIngestionQuota` and user-controllable retention from v1; both move to follow-on enhancements Regenerate ingestion-pipeline diagram to match.
Extend the AI Edge access and WAF log entry schemas to support a per-request lifecycle view (firewall decision, geo/routing) keyed off a shared correlation ID. Key changes: - Add http.request.id (Envoy x-request-id) to both httpproxy-access and httpproxy-waf so a single request_id filter returns the access entry and every WAF rule that fired on the request - Denormalise waf.outcome and waf.matched_rules onto the access log so "show me blocked requests" is a single stream filter, not a join; per-rule detail stays on the WAF log for drill-down - Add user_agent.original to the access log - Add edge.pop.ingress (and edge.pop.upstream on the access log) to carry the ingress / upstream PoP per request. PoP is emission context, not resource identity, so it lives on the entry schema not the MonitoredResourceType - Add a Request Correlation subsection explaining the join key and the denormalised-summary pattern
Remove edge.pop.ingress and client.address from the WAF entry schema. Both are already on the paired access log and reachable by joining on http.request.id; duplicating them on every matched-rule row is waste, and the WAF has no concept of edge.pop.upstream anyway since it runs before the routing decision. Clarify in the Request Correlation subsection that the WAF schema is deliberately lean — any context that exists on the access log is reached via the join rather than copied per matched rule.
|
@scotwells thanks for your feedback, have left notes on your comments and updated the doc. I've also added some bits in about using the |
Summary
Proposes a v1 customer-facing logs pipeline for the Datum platform, scoped to AI Edge (HTTPProxy access logs and WAF events) as the first producer. Doc lives at
docs/architecture/customer-facing-logs.md.What's in the design
ServiceConfiguration.spec.logs[]fans out to a newtelemetry.LogDefinitionCRD;spec.monitoredResourceTypes[]additionally fans out to a newtelemetry.MonitoredResourceTypeCRD alongside the existing billing fan-out.cloud.account.id, validates resource attributes against the declared label vocabulary, derivestenant_id, and writes to ClickHouse.platform_logstable,tenant_idfirst inORDER BYand partition key,log_idandresource_typepromoted to top-level columns./projects/{project}/telemetry/loki/api/v1/.... URL is authoritative for tenancy;X-Scope-OrgIDignored. Labels/series served from theMonitoredResourceTypecatalog so Grafana works on empty projects.allLogs, 400d foraudit; per-category overrides applied via a TTL column at write time.LogIngestionQuotaresource on(project, category_group); 429 + drop counter on exceed, no silent drops.allLogsopt-in viaLogCollectionPolicy;auditalways-on.LogRedactionPolicy. Body is not redacted.Scope
v1 producer is AI Edge only: two
LogDefinitions onnetworking.datumapis.com/HTTPProxy— access log (allLogs) and WAF events (allLogs+audit). Control-plane audit logs (covered bymilo-os/activity) integrate later via the shared catalog.LogSourceinExportPolicyis deferred.Open questions
Captured explicitly at the end of the doc: live tail backend (poll vs. Kafka),
LogCollectionPolicygranularity, the LogQL subset to support in v1, and how catalog-backed label discovery handles tenant-specific label values.