Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape external service with FQDN #3204

Closed
shay-berman opened this issue May 8, 2020 · 32 comments
Closed

Scrape external service with FQDN #3204

shay-berman opened this issue May 8, 2020 · 32 comments

Comments

@shay-berman
Copy link

shay-berman commented May 8, 2020

What happened?
Cannot scrape a service with its FQDN that is out side of the k8s cluster. It only works if you set the service IP, but I prefer not to use IP(s) which may change.

See the prometheus UI that just show the servicemonitor name but nothing inside the endpoints list:
image

Here is the yaml that define the Prometheus CR + MonitorService + External service with the SERVICE-FQDN. But when you open prometheus it will not scrape the external service.

apiVersion: v1
kind: Service
metadata:
  name: rs1
  labels:
    app: prometheus1 # ServiceMonitor match this label

spec:
  externalName: <SERVICE-FQDN>
  type: ExternalName
  ports:
  - name: https
    protocol: TCP
    port: 8070
    targetPort: 8070

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus1
  labels:
    app: prometheus1 # This is what the prometheus1 CR looking for match
spec:
  endpoints:
  - path: /
    scheme: https
    name: https
    tlsConfig:
      insecureSkipVerify: true 
  jobLabel: jobName
  selector:
    matchLabels:
      app: prometheus1


---

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    app: prometheus-operator-prometheus1
  name: prometheus-operator-prometheus1
spec:
  replicas: 1
  serviceAccountName: chart1-prometheus-operator-prometheus
  alerting:
    alertmanagers:
    - apiVersion: v2
      name: chart1-prometheus-operator-alertmanager
      namespace: default
      pathPrefix: /
      port: web
  image: quay.io/prometheus/prometheus
  version: v2.15.2
  logLevel: debug
  portName: web
  routePrefix: /
  retention: 10d
  serviceMonitorSelector:
    matchLabels:
      app: prometheus1
  ruleNamespaceSelector: {}
  thanos:
    baseImage: thanosio/thanos
    version: v0.12.0

The only way to scrape the SERVICE-FQDN is by adding also Endpoints object that point to the SERVICE-FQDN specific IP(s). Only then you can see the target in prometheus working. But the whole point it to use only with SERVICE-FQDN and not with specific IPs.

Did you expect to see something different?
I would expect to have an option to scrape also by SERVICE-FQDN and not only by the IPs.
Here are some blogs that explain how to scrape external service but again only with endpoints with specific IPs:

But again none of them use the FQDN, and I would expect to have such way.

How to reproduce it (as minimally and precisely as possible):
Just use the yaml above and you will see that the service (ExternalName) is not visible as target in promethes.

Environment

  • Prometheus Operator version:
    quay.io/coreos/prometheus-operator:v0.37.0

`

  • kubectl describe deployment chart1-prometheus-operator-operator
    Name: chart1-prometheus-operator-operator
    Namespace: default
    CreationTimestamp: Wed, 06 May 2020 21:31:34 +0300
    Labels: app=prometheus-operator-operator
    app.kubernetes.io/managed-by=Helm
    chart=prometheus-operator-8.12.12
    heritage=Helm
    release=chart1
    Annotations: deployment.kubernetes.io/revision: 2
    meta.helm.sh/release-name: chart1
    meta.helm.sh/release-namespace: default
    Selector: app=prometheus-operator-operator,release=chart1
    Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
    StrategyType: RollingUpdate
    MinReadySeconds: 0
    RollingUpdateStrategy: 25% max unavailable, 25% max surge
    Pod Template:
    Labels: app=prometheus-operator-operator
    chart=prometheus-operator-8.12.12
    heritage=Helm
    release=chart1
    Service Account: chart1-prometheus-operator-operator
    Containers:
    prometheus-operator:
    Image: quay.io/coreos/prometheus-operator:v0.37.0
    Port: 8080/TCP
    Host Port: 0/TCP
    Args:
    --manage-crds=true
    --kubelet-service=kube-system/chart1-prometheus-operator-kubelet
    --logtostderr=true
    --localhost=127.0.0.1
    --prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0.37.0
    --config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1
    --config-reloader-cpu=100m
    --config-reloader-memory=25Mi
    --log-level=debug
    Environment:
    Mounts:
    tls-proxy:
    Image: squareup/ghostunnel:v1.5.2
    Port: 8443/TCP
    Host Port: 0/TCP
    Args:
    server
    --listen=:8443
    --target=127.0.0.1:8080
    --key=cert/key
    --cert=cert/cert
    --disable-authentication
    Environment:
    Mounts:
    /cert from tls-proxy-secret (ro)
    Volumes:
    tls-proxy-secret:
    Type: Secret (a volume populated by a Secret)
    SecretName: chart1-prometheus-operator-admission
    Optional: false
    Conditions:
    Type Status Reason
    Available True MinimumReplicasAvailable
    Progressing True NewReplicaSetAvailable
    OldReplicaSets:
    NewReplicaSet: chart1-prometheus-operator-operator-746d86bbb7 (1/1 replicas created)
    Events:

`

  • Kubernetes version information:

    + kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

GKE

  • Manifests:
insert manifests relevant to the issue
  • Prometheus Operator Logs:
  • unfortunately don't see much log related. Its hard to debug why its not working.

Anything else we need to know?:

- job_name: job
  scrape_interval: 30s
  scrape_timeout: 30s
  metrics_path: /
  scheme: https
  tls_config:
    insecure_skip_verify: true
  static_configs:
    - targets:
      - <SERVICE-FQDN>:port

this will work, but again i want to do it k8s way, by setting externalName service with its FQDN.

@sebarys
Copy link

sebarys commented May 9, 2020

I think externalNames are not supported there is few issues about it e.g. #218

@shay-berman
Copy link
Author

Thanks @sebarys for direct me to a similar issue #218.

BUT it looks like this is very long thread without a solution:
Issue #218 was closed and redirect to ---> #372, which was closed and then redirect to --> prometheus/prometheus#2791, which was closed without any formal solution for ExternalName service in k8s.

The last summary of this thread are #218 (comment) and prometheus/prometheus#2791 (comment) but again no formal solution.

If I understood correctly to scrape external service is to use EndpointIP as mentioned #834 (comment) (or in blog), But its not help if you need to use serviceFQDN.

Another way to solve it is just to use the old way(not the k8s way) define additional scrape configs with regular static_configs But again this way you don't use the k8s concept of ExternalName service.

  static_configs:
    - targets:
      - SERVICE-FQDN

@gouthamve \ @brancz \ @sebarys
So can you please provide detail what is the best practice of how to scrape an external serviceFQDN (without using Endpoint IP, just by the FQDN of the service)? And I think it should be official documented.

@shay-berman
Copy link
Author

@gouthamve \ @brancz \ @sebarys - can anyone help on this please?

@sebarys
Copy link

sebarys commented May 13, 2020

In our project we've added this using static_configs as it looks that there is no plan for now to have this feature in prometheus-operator

@brancz
Copy link
Contributor

brancz commented May 13, 2020

You cannot in a meaningful way monitor external services as prometheus needs to scrape each instance/process individually. That’s why you need to use a separate discovery mechanism that actually does discover all processes.

@shay-berman
Copy link
Author

ok so based on your feedback it looks like there is no plan to support scrape ExternalName k8s service type. And your recommendation is to use static_configs to define external service.

Should I close this ticket?

@brancz
Copy link
Contributor

brancz commented May 18, 2020

The way the issue is created it won't happen. That said, we have thought about having more generic scrape configs available through some new CRD in the prometheus operator. That could be something that could be used for that. As far as I know there is no one working on this currently though.

@elsbrock
Copy link

You cannot in a meaningful way monitor external services as prometheus needs to scrape each instance/process individually. That’s why you need to use a separate discovery mechanism that actually does discover all processes.

One use case I see is federation, where I want to configure another Prometheus instance as target to be scraped. It'd be great if that were possible by means of a ServiceMonitor.

@brancz
Copy link
Contributor

brancz commented Jun 23, 2020

@elsbrock for a Prometheus in the same cluster this is perfectly possible. The federation endpoint on prometheus is no different from any other /metrics endpoint, so all you need to do is change the path to scrape from to /federate instead of /metrics which you can specify using the path field in the ServiceMonitor endpoint definition.

@elsbrock
Copy link

Right, but in our case the Prometheus instance is running in an entirely different network segment, so we either need to use the global config (which I don't find to be nice from dependency point of view, a ServiceMonitor seems much better) or set up a reverse proxy pod that I can then scrape using a ServiceMonitor.

@brancz
Copy link
Contributor

brancz commented Jun 24, 2020

Yes for those cases an additionalScrapeConfig is best.

@mrueg
Copy link
Contributor

mrueg commented Jul 3, 2020

@brancz I wonder if there's a way to provide additionalScrapeConfigs as a Kubernetes CRD object?
Often multiple teams share a single prometheus(-operator) deployment and that would enable self-service for scrapeconfigs.

@brancz
Copy link
Contributor

brancz commented Jul 4, 2020

This is not possible today, but I would like to get there one day. I would like to essentially introduce a lower level CRD "ScrapeConfig", which all the other config generation CRs are ultimately converted to. The difficulty is maintaining the types for such a CR, this would need to be automated, by inspecting the types from Prometheus and converting.All of this is not impossible but it will need some non trivial amount of work, which I currently don't have time for. If anyone from the community would like to invest time into this though I'd be happy to discuss possible designs and caveats that I think of.

@mircohacker
Copy link

@brancz I would be interested to look into implementing this CRD. How should we proceed?

@brancz
Copy link
Contributor

brancz commented Jul 17, 2020

I think a design doc would be in order, as what I'm imagining would involve synchronizing types from the prometheus repo.

@angeloskaltsikis
Copy link

Any news on this feature? (or the design doc?)

@jasonstitt
Copy link

jasonstitt commented Dec 9, 2020

Running into this trying to scrape AWS MSK (see: https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html). MSK provides a FQDN for each broker, and we also have them aliased to consistent in-cluster names using ExternalName services. The underlying IP addresses might be stable, but I don't see sufficient documentation to rely on that, and in any case they would have to be hardcoded per cluster.

So now the options are (a) bypass the CRD setup and use config files (aka additionalScrapeConfigs) or (b) set up reverse proxies just to scrape existing endpoints that are available to scrape.

You cannot in a meaningful way monitor external services as prometheus needs to scrape each instance/process individually. That’s why you need to use a separate discovery mechanism that actually does discover all processes.

This is an example of a case in which you can (as there is an FQDN provided per instance).

@i9
Copy link

i9 commented Jan 21, 2021

Running into this trying to scrape AWS MSK (see: https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html). MSK provides a FQDN for each broker, and we also have them aliased to consistent in-cluster names using ExternalName services. The underlying IP addresses might be stable, but I don't see sufficient documentation to rely on that, and in any case they would have to be hardcoded per cluster.

So now the options are (a) bypass the CRD setup and use config files (aka additionalScrapeConfigs) or (b) set up reverse proxies just to scrape existing endpoints that are available to scrape.

You cannot in a meaningful way monitor external services as prometheus needs to scrape each instance/process individually. That’s why you need to use a separate discovery mechanism that actually does discover all processes.

This is an example of a case in which you can (as there is an FQDN provided per instance).

@jasonstitt have you tried

static_configs:
      - targets: ['msk-alias-1.namespace:11001']

it worked for us

@lilic
Copy link
Contributor

lilic commented Jan 21, 2021

If anyone is willing to do a design doc for this they are more than welcome to create a PR for this! 🎉

@alexisph
Copy link

Hi. I came across this issue in our OpenShift clusters. Here's how I solved it:

  1. Create service with externalName as the public URL
  2. Create endpoints with service IP address and same name as service above
  3. Create servicemonitor for this new service which replaces the __address__ label
...
spec:
  endpoints:
  - path: /metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    relabelings:
      - sourceLabels: [__address__]
        targetLabel: __address__
        regex: (.*)
        replacement: "$FQDN:$PORT"
        action: replace
...

I was then able to scrape the FQDN!

@miguel-callejas-coderoad-com

@alexisph that's great. Can you share a little bit more of your configuration. I'm trying configuring a Service with ExternalName property to reach the FQDN outside Kubernetes with your recommendations, but had no luck.

kind: "Service"
apiVersion: "v1"
metadata:
  namespace: workload
  name: nfs-centralus-001
  labels:
    workload.stateful: nfs-centralus-001
spec:
  type: ExternalName
  externalName: nfs-centralus-001.c.saas-workload-io.internal
  selector:
    workload.stateful: nfs-centralus-001

and the Service Monitor looks like

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    ops.workload.io/component: nfs-centralus-001
    ops.workload.io/category: infrastructure
  name: nfs-centralus-001
  namespace: workload
spec:
  endpoints:
  - path: /metrics
    interval: 15s
    targetPort: 9100
    scheme: http
    relabelings:
      - sourceLabels: [__address__]
        targetLabel: __address__
        regex: (.*)
        replacement: "$FQDN:$PORT"
        action: replace
  jobLabel: ops.workload.io/nfs-centralus-001
  namespaceSelector:
    matchNames:
    - workload
  selector:
    matchExpressions:
      - key: workload.stateful
        operator: In
        values: ["nfs-centralus-001"]

@alexisph
Copy link

@miguel-callejas-coderoad-com , you're missing the Endpoints resource. Based on your example:

apiVersion: v1
kind: Endpoints
metadata:
  name: nfs-centralus-001
  namespace: workload
  labels:
    workload.stateful: nfs-centralus-001
subsets:
- addresses:
  - ip: 1.2.3.4
  - ip: 1.2.3.5
  ports:
  - name: metrics
    port: 9100
    protocol: TCP

@marratj
Copy link

marratj commented Aug 2, 2021

A quick note on this, as I was struggling with finding the same config:

You don't even need to specify the real IP of your destination FQDN in the Endpoint, it can be just any IP, because by relabeling __address__ you're instructing Prometheus to actually scrape what is specified in the __address__ label.

This would be usually populated by the IP address that is defined in the Endpoint, but we're replacing it with a whole different address here, so the actual IP in the Endpoint doesn't even matter to Prometheus itself anymore.

@elsbrock
Copy link

Can you give an example?

@hryamzik
Copy link

@alexisph @miguel-callejas-coderoad-com do you literally use "$FQDN:$PORT" as replacement?

It works for me if I put real values there (in.e. "myhost.example.com:8080") but "$FQDN:$PORT" produces instance=":".

@alexisph
Copy link

@hryamzik , just use the FQDN and port of the service you'd want to scrape, like in your example.

@hryamzik
Copy link

That's what I already do, hoped for a more elegant solution as externalName already contains it. Got it, ty!

@r0bj
Copy link
Contributor

r0bj commented Oct 2, 2021

This issues became more important after k8s 1.22 in which write access to Endpoints has been disabled by default in admin roles due to CVE-2021-25740:
kubernetes/kubernetes#103675
https://kubernetes.io/docs/reference/access-authn-authz/rbac/#write-access-for-endpoints

@paulfantom
Copy link
Member

The main solution for this would be implementing Generic ScrapeConfig CRD described in #2787. Contributions welcome.

@cuchac
Copy link

cuchac commented Nov 20, 2021

@hryamzik I found a better solution that does not require to duplicate domains.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
  endpoints:
  - interval: 30s
    path: /_prometheus/metrics/
    port: web
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_endpoint_node_name
      targetLabel: __address__
  selector:
     ...
apiVersion: v1
kind: Endpoints
metadata:
  ...
subsets:
- addresses:
  - ip: 1.2.3.5
    nodeName: www.example.com
  ports:
...

or you can use __meta_kubernetes_service_name or __meta_kubernetes_endpoints_name in sourceLabels to get hostname from service or endpoints name. I use __meta_kubernetes_endpoint_node_name so that i can have more domains inside one endpoint.

@ig-matsz
Copy link

ig-matsz commented Feb 8, 2022

Hi.
Based on the previous posts we created a similar workaround. Let me share:

apiVersion: v1
kind: Service
metadata:
  name: external-dev-prometheus
  namespace: monitoring
  labels:
    k8s-app: external-dev-prometheus
spec:
  type: ExternalName
  externalName: FQDN
  ports:
  - name: metrics
    port: 443
    protocol: TCP
    targetPort: 443

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: external-dev-prometheus
  namespace: monitoring
  labels:
    k8s-app: external-dev-prometheus
spec:
  endpoints:
  - port: metrics
    interval: 30s
    honorLabels: true
    scheme: https
    path: /metrics
    tlsConfig:
      insecureSkipVerify: true
    relabelings:
      - sourceLabels: [__address__]
        targetLabel: __address__
        regex: (.*)
        replacement: "FQDN:443"
        action: replace
  selector:
    matchLabels:
      k8s-app: external-dev-prometheus
  namespaceSelector:
    matchNames:
    - monitoring

apiVersion: v1
kind: Endpoints
metadata:
  name: external-dev-prometheus
  namespace: monitoring
  labels:
    k8s-app: external-dev-prometheus
subsets:
- addresses:
  - ip: 1.2.3.4
  ports:
  - name: metrics
    port: 443
    protocol: TCP


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    release: kube-prometheus-stack
  name: external-dev-prometheus-endpoint
  namespace: monitoring
spec:
  groups:
    - name: critical-external
      rules:
        - alert: PrometheusTargetMissing
          expr: up {endpoint="metrics",  namespace="monitoring", service="external-dev-prometheus"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            message: Prometheus target missing (instance {{ $labels.instance }})
            description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

This solved the issue and everything was perfect. Unfortunately we noticed that after some short (few hours) but random time targets are disappearing in effect disabling this monitoring setup.

Here's an illustration of the event:

Screenshot 2022-02-09 at 10 15 04

We have 10 endpoints from prometheus operator such as node exporters, alert manager and kube-prometheus-stack. We are adding 8 custom endpoints as described above. You can see that all our custom external endpoints are gone at the same time.

Our setup:
kube-prometheus-stack-13.10.0
But we also veryfied this on kube-prometheus-stack-31.0.1
with the same result.

What we verified:

  • all objects that I pasted are still present in the cluster
  • we weren't able to find any meaningful log messages indicating problems, for example with WALs
  • the file /etc/prometheus/config_out/prometheus.env.yaml of the prometheus pod has this endpoint config
  • reapplying those objects brings back the target, but then it is gone again after some random time

Has anyone seen anything similar? Can anyone give some hints into what else can we check? Thanks, appreciate for any feedback.

@simonpasquier
Copy link
Contributor

simonpasquier commented Feb 15, 2022

Closing this issue in favor of #2787 (generic scrape config CRD) which should resolve the original request eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests