Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

Move alertmanager and credential/exporter config APIs into common /v1/ path under the root cluster domain #379

Closed
nickbp opened this issue Feb 17, 2021 · 15 comments
Assignees

Comments

@nickbp
Copy link
Contributor

nickbp commented Feb 17, 2021

DEADLINE: Try to have this figured out by the end of the week (Feb 19)

There are currently multiple endpoints for setting configuration for tenants:

  • config.cluster.example/api/v1/credentials/*
  • config.cluster.example/api/v1/exporters/*
  • cortex.cluster.example/ruler/*
  • cortex.cluster.example/api/v1/rules/*
  • cortex.cluster.example/api/v1/alerts/*
  • cortex.cluster.example/alertmanager/*
  • cortex.cluster.example/multitenant_alertmanager/*

This current structure has a few issues:

  • The config subdomain used by credential/exporter config may conflict with a tenant named config, and it's a bit confusing since the subdomains are ideally supposed to all be about tenants anyway.
  • For alertmanager configuration, we'd ideally like configuration endpoints to be separate from data ingest endpoints.

So we'd ideally like to only use subdomains for per-tenant data ingest, and then have these config endpoints at the root domain (shared with the UI). Requests to the config endpoints would identify the tenant name using the Bearer token, as is currently the case.

The changes will be to move the endpoints to the root domain, under a reserved /api/* path like the following:

  • Move config.cluster.example/api/v1/credentials/* to cluster.example/api/v1/credentials/*
  • Move config.cluster.example/api/v1/exporters/* to cluster.example/api/v1/exporters/*
  • cortex.cluster.example/ruler/* to cluster.example/api/v1/ruler/*
  • cortex.cluster.example/api/v1/rules/* to cluster.example/api/v1/rules/*
  • cortex.cluster.example/api/v1/alerts/* to cluster.example/api/v1/alerts/*
  • cortex.cluster.example/alertmanager/* to cluster.example/api/v1/alertmanager/* (may go to cluster.example/alertmanager/* in the future)
  • cortex.cluster.example/multitenant_alertmanager/* to cluster.example/api/v1/multitenant_alertmanager/* (may go to cluster.example/multitenant_alertmanager/* in the future)

The changes would be applied in a few places:

  • Move the alertmanager config endpoint handling into the API config executable (see /go/cmd/config/)
  • Update the config.cluster.example/* K8s Ingest object to instead accept data sent to cluster.example/api/v1/*. Any requests sent to this endpoint will then be sent to the API config pod, and not the UI
  • Rename config-api to opstrace-api to align with opstrace-application used for the UI

See also: Current credentials/exporters APIs listed in a README here: https://github.com/opstrace/opstrace/blob/main/go/cmd/config/README.md

@spahl
Copy link
Contributor

spahl commented Feb 17, 2021

I added the rules and alert related endpoints to the list.

@MatApple
Copy link
Contributor

@spahl what is the difference between alertmanager and multitenant_alertmanager ?

@MatApple
Copy link
Contributor

MatApple commented Feb 17, 2021

Also, let's put all these apis under a common root path. /api/ seems reasonable to me. Perhaps we need one other which is /components/ for things like Alertmanager so we have /components/alertmanager/* to access the alertmanager endpoints (ui & apis).

Remember the UI is served at the root too, so lets have a very small number of namespaces (represented by the root of the path) for which we don't render the UI and instead proxy to an api or a system component.

@MatApple
Copy link
Contributor

MatApple commented Feb 17, 2021

Ok after investigating a bit more, I think we shouldn't need to expose the /ring and /status on the multitenant_alertmanager api of Cortex (https://cortexmetrics.io/docs/api/#alertmanager-status). We can figure out how to build this state into the UI.

In general, I'm of the mindset that we shouldn't expose everything that our internal components expose, unless it's absolutely necessary, because it's another surface that we have to maintain for our users.

The other thing to keep in mind is that anything we serve under the root path cannot be authenticated by the ingress because the UI needs to be accessed without authentication (login). Authentication therefore has to be done in the application that is serving the endpoint. The UI does its own authentication, @nickbp's config apis do their own authentication (but still require someone to set the X-Tenant header). The alerts and rules endpoints require a X-Scope-OrgID header (which is equivalent to X-Tenant). It's starting to feel a bit bizarre to me to have all these api's served at cluster.example/ while having to send X-Tenant header or X-Scope-OrgID headers for some of them to specify the tenant.

Another important point is that the /v1/alerts and /v1/rules do not have authentication, but require authentication. This means we'd have to build an authentication proxy that sit's behind the Ingress (can't have auth on the ingress as described above).

At this point, I'd rather serve all these tenant-specific apis under the tenant's domain i.e. prod.cluster.example.com/apis/ and utilize the authentication that already exists on the Ingress. Now the problem is, how do we use an Ingress resource in the tenant's namespace to connect to a service in the Cortex namespace? It should be possible to expose a service in tenant-a by using an ExternalName type Service in the Cortex namespace:

kind: Service
apiVersion: v1
metadata:
  name: cortex-api
  namespace: cortex
spec:
  type: ExternalName
  externalName: cortex-api.tenant-a.svc.cluster.local
  ports:
  - port: 80

The Ingress resource in the tenant-a namespace can then reference cortex-api as its upstream service.

@MatApple
Copy link
Contributor

MatApple commented Feb 17, 2021

Extending on my suggestion above, our endpoints now use a combination of dns and path based routing:

These two endpoints proxy to the cortex service in the cortex namespace:
tenant-name.cluster.example.com/api/v1/rules/*
tenant-name.cluster.example.com/api/v1/alerts/*

Credentials and exporters can proxy to the config service in the application namespace:
tenant-name.cluster.example.com/api/v1/credentials/*
tenant-name.cluster.example.com/api/v1/exporters/*

Alertmanager UI is served the same as we currently serve Grafana in each tenant (we just have to create an ExternalService to point to the alertmanager UI service in the cortex namespace) - other than that, the alertmanager UI is already configured to be accessed here:
alertmanager.tenant-name.cluster.example.com

@MatApple
Copy link
Contributor

MatApple commented Feb 17, 2021

Three pros for the above suggestion:

  1. Exposing the endpoints on each tenants domain results in the client not having to specify a tenant header for each request they make (since we inject it in the Ingress).
  2. The UI is served at cluster.example.com/, so if we decide to serve apis under this same path, we'll have to be thoughtful about route clashes.
  3. We already have tenant-specific apis exposed at <api>.<tenant>.cluster.example.com, so moving to serving these new apis under cluster.example.com/ would change our current philosophy. If we were to break away from our current thinking around tenant-specific apis, then we should be consistent (which probably means migrating some existing apis)

@MatApple
Copy link
Contributor

curious to hear @jgehrcke and @sreis thoughts on this matter.

@nickbp
Copy link
Contributor Author

nickbp commented Feb 18, 2021

FWIW the current X-Tenant header mentioned in the README is only used in testing, when -disable-api-authn is specified when starting the service. By default it ignores X-Tenant and instead checks for the Authentication header/Bearer token. The reason for the X-Tenant option is just to have an "escape hatch" for specifying the tenant in testing situations where generating a valid signed Authentication header would be difficult.

(And the X-Tenant name itself could be changed, I just added it for my own use in testing so it's not "locked in" anywhere yet.)

@MatApple
Copy link
Contributor

@nickbp since the UI doesn't use a Bearer token for auth (uses a session cookie), it will not be able to access the apis. This is one more compelling reason for the UI to interact with the query/exporters apis through Hasura (although there is likely still the challenge of injecting a bearer token from Hasura -> credentials/exporters api).

Maybe the authenticator.go middleware also needs to check for a valid session if no Bearer token is provided?

@nickbp
Copy link
Contributor Author

nickbp commented Feb 18, 2021

That does sound like it would be tricky. I assume any session info provided by the UI (e.g. via an HTTP header) isnt going to specify a single tenant, since a UI user may be managing multiple tenants. It does sound like the ideal scenario would be to treat this Go API as something that end-users would be calling, while the UI would talk to graphql directly - and ideally graphql would implement sanity checks via hasura actions like you'd recommended.

I'll add notes about this to the hasura actions issue here: #373

In the context of this immediate issue around moving the paths of the current API, it shouldn't be a blocker, but it does sound like we'll need to have a plan for how any such validation would work sooner rather than later so that it isn't blocking the UI.

@triclambert
Copy link
Contributor

Just a couple of misc thoughts having read through all of this...

/api/ seems reasonable to me.

+1, should we go down this path. In some other cases, we may be able to simplify the root to something shorter, such as a single letter.

we shouldn't expose everything that our internal components expose, unless it's absolutely necessary,

+1

the alertmanager UI is already configured to be accessed here: alertmanager.tenant-name.cluster.example.com

If we maintain the subdomains... if we can serve the UI with path-based routing to match /grafana, that seems preferable to me.

nickbp pushed a commit that referenced this issue Feb 22, 2021
… in GraphQL, the controller will automatically apply that change to Kubernetes. Credentials are converted to Kubernetes `Secret`s, while exporters are converted to Kubernetes `Deployment`s (which may link to the `Secret`s referenced by the exporter config).

The synchronization itself works against the existing Kubernetes reconciliation loop in the Controller. The main addition is that the reconciliation state now includes entries originating from GraphQL, but the reconciliation flow itself remains the same structurally and ends up being pretty straightforward. This structure would potentially be extendable to other state retrieved from GraphQL in the future.

Summary:
- Add GraphQL queries for dumping a "snapshot" of credentials and exporters when starting the sync, in addition to the existing subscription queries.
- Add subscription client to controller, largely copied from app.
- Implement `informers` for fetching and subscribing to the GraphQL credentials and exporters, and applying this data to the Redux state object.
- Implement `resources` for converting the Redux state to Kubernetes objects, with type-specific logic for how to deploy each credential and exporter type.

I've so far tested this by hand by creating credentials/exporters via `curl` against the go graphql `config` service (which is due to get moved/renamed), and was able to verify that actual metrics successfully appeared in prometheus/cortex from each of the current two types of exporters. As such this should complete issue #312 and epic #310 . There remains some doc work via #358 which in turn depends on the graphql config API naming that's being updated via #379 .

Signed-off-by: Nick Parker <nick@opstrace.com>
nickbp pushed a commit that referenced this issue Feb 22, 2021
…ated/deleted in GraphQL, the controller will automatically apply that change to Kubernetes. Credentials are converted to Kubernetes `Secret`s, while exporters are converted to Kubernetes `Deployment`s (which may link to the `Secret`s referenced by the exporter config).

The synchronization itself works against the existing Kubernetes reconciliation loop in the Controller. The main addition is that the reconciliation state now includes entries originating from GraphQL, but the reconciliation flow itself remains the same structurally and ends up being pretty straightforward. This structure would potentially be extendable to other state retrieved from GraphQL in the future.

Summary:
- Add GraphQL queries for dumping a "snapshot" of credentials and exporters when starting the sync, in addition to the existing subscription queries.
- Add subscription client to controller, largely copied from app.
- Implement `informers` for fetching and subscribing to the GraphQL credentials and exporters, and applying this data to the Redux state object.
- Implement `resources` for converting the Redux state to Kubernetes objects, with type-specific logic for how to deploy each credential and exporter type.

I've so far tested this by hand by creating credentials/exporters via `curl` against the go graphql `config` service (which is due to get moved/renamed), and was able to verify that actual metrics successfully appeared in prometheus/cortex from each of the current two types of exporters. As such this should complete issue #312 and epic #310 . There remains some doc work via #358 which in turn depends on the graphql config API naming that's being updated via #379 .

Signed-off-by: Nick Parker <nick@opstrace.com>
nickbp pushed a commit that referenced this issue Feb 23, 2021
…ated/deleted in GraphQL, the controller will automatically apply that change to Kubernetes. Credentials are converted to Kubernetes `Secret`s, while exporters are converted to Kubernetes `Deployment`s (which may link to the `Secret`s referenced by the exporter config).

The synchronization itself works against the existing Kubernetes reconciliation loop in the Controller. The main addition is that the reconciliation state now includes entries originating from GraphQL, but the reconciliation flow itself remains the same structurally and ends up being pretty straightforward. This structure would potentially be extendable to other state retrieved from GraphQL in the future.

Summary:
- Add GraphQL queries for dumping a "snapshot" of credentials and exporters when starting the sync, in addition to the existing subscription queries.
- Add subscription client to controller, largely copied from app.
- Implement `informers` for fetching and subscribing to the GraphQL credentials and exporters, and applying this data to the Redux state object.
- Implement `resources` for converting the Redux state to Kubernetes objects, with type-specific logic for how to deploy each credential and exporter type.

I've so far tested this by hand by creating credentials/exporters via `curl` against the go graphql `config` service (which is due to get moved/renamed), and was able to verify that actual metrics successfully appeared in prometheus/cortex from each of the current two types of exporters. As such this should complete issue #312 and epic #310 . There remains some doc work via #358 which in turn depends on the graphql config API naming that's being updated via #379 .

Signed-off-by: Nick Parker <nick@opstrace.com>
MatApple pushed a commit that referenced this issue Feb 23, 2021
…ated/deleted in GraphQL, the controller will automatically apply that change to Kubernetes. Credentials are converted to Kubernetes `Secret`s, while exporters are converted to Kubernetes `Deployment`s (which may link to the `Secret`s referenced by the exporter config).

The synchronization itself works against the existing Kubernetes reconciliation loop in the Controller. The main addition is that the reconciliation state now includes entries originating from GraphQL, but the reconciliation flow itself remains the same structurally and ends up being pretty straightforward. This structure would potentially be extendable to other state retrieved from GraphQL in the future.

Summary:
- Add GraphQL queries for dumping a "snapshot" of credentials and exporters when starting the sync, in addition to the existing subscription queries.
- Add subscription client to controller, largely copied from app.
- Implement `informers` for fetching and subscribing to the GraphQL credentials and exporters, and applying this data to the Redux state object.
- Implement `resources` for converting the Redux state to Kubernetes objects, with type-specific logic for how to deploy each credential and exporter type.

I've so far tested this by hand by creating credentials/exporters via `curl` against the go graphql `config` service (which is due to get moved/renamed), and was able to verify that actual metrics successfully appeared in prometheus/cortex from each of the current two types of exporters. As such this should complete issue #312 and epic #310 . There remains some doc work via #358 which in turn depends on the graphql config API naming that's being updated via #379 .

Signed-off-by: Nick Parker <nick@opstrace.com>
@nickbp nickbp self-assigned this Feb 23, 2021
@nickbp nickbp added the in progress state (used by codetree) label Feb 23, 2021
@nickbp
Copy link
Contributor Author

nickbp commented Feb 24, 2021

Had a quick check-in on Slack, sounds like it's fine to just put the two APIs in question under cluster.example/api/v1/alertmanager/* and cluster.example/api/v1/multitenant_alertmanager/* for now - so the "reserved space" in the K8s Ingress would be everything under /api/*.

It sounds like /alertmanager is a UI, so we may yet need to revisit that specific case later if the subpath gets to be too weird.

@nickbp
Copy link
Contributor Author

nickbp commented Feb 24, 2021

Note from @MatApple: Will need to add new RegExp("^/api/") to the list here: https://github.com/opstrace/opstrace/blob/main/packages/app/config/swTemplate.js#L46

@nickbp
Copy link
Contributor Author

nickbp commented Feb 25, 2021

Something I realized this morning with the /api/v1/[alertmanager,multitenant_alertmanager] redirects is that if these return HTML pages, any absolute URL paths in responses may need to be updated. I'll check real quick whether that's the case.

The Go reverse proxy code supports intercepting the response, so it should be possible to have a rewrite of HTML content if necessary, but if that ends up being needed then IMO it may push things in favor of just having those endpoints accessible at their original root paths and thereby avoid the need for rewrites. This would be at the cost of needing to add special cases for /alertmanager and /multitenant_alertmanager in addition to the planned /api exception.

@spahl
Copy link
Contributor

spahl commented Feb 25, 2021 via email

nickbp pushed a commit that referenced this issue Feb 25, 2021
* controller: Update routes for config APIs

- Updates application ingress to route `<cluster>/api/*` to the config-api pod (was: `config.<cluster>/api/*`)
- Moves cortex alertmanager and ruler APIs into the config-api pod
- Updates the config-api pod with alertmanager and ruler endpoints to be proxied
- Refactors auth and proxy handling a bit for reuse across services

Signed-off-by: Nick Parker <nick@opstrace.com>

* config: Route /api/v1/[ruler,alertmanager,multitenant_alertmanager] -> /*

The default logic in `NewSingleHostReverseProxy` just supports appending paths (e.g. `/foo` + `/bar` = `/foo/bar`), but in this case we want to overwrite the source path rather than appending it to the dest path. So make/use a custom director that supports custom rewrites.

Signed-off-by: Nick Parker <nick@opstrace.com>

* config: Fix lint, simplify redirect handling

Signed-off-by: Nick Parker <nick@opstrace.com>

* controller: Assign alertmanager UI http prefix, fix config path rewrite

Signed-off-by: Nick Parker <nick@opstrace.com>
@nickbp nickbp closed this as completed Feb 25, 2021
@nickbp nickbp removed the in progress state (used by codetree) label Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

4 participants