Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus pod cannot find prometheus.env.yaml on start #3061

Closed
Yangstatus opened this issue Mar 4, 2020 · 14 comments
Closed

Prometheus pod cannot find prometheus.env.yaml on start #3061

Yangstatus opened this issue Mar 4, 2020 · 14 comments

Comments

@Yangstatus
Copy link

@Yangstatus Yangstatus commented Mar 4, 2020

Please help me to check this error, thank you!

Below is the description of my pod

[root@k8s-master-1 alertmanager]# kubectl describe pod prometheus-prometheus-operator-158278-prometheus-0
Name: prometheus-prometheus-operator-158278-prometheus-0
Namespace: default
Priority: 0
Node: k8s-node-2/172.29.1.102
Start Time: Mon, 02 Mar 2020 16:30:42 +0800
Labels: app=prometheus
controller-revision-hash=prometheus-prometheus-operator-158278-prometheus-674c57cc8c
prometheus=prometheus-operator-158278-prometheus
statefulset.kubernetes.io/pod-name=prometheus-prometheus-operator-158278-prometheus-0
Annotations: cni.projectcalico.org/podIP: 10.100.140.127/32
Status: Running
IP: 10.100.140.127
IPs:
IP: 10.100.140.127
Controlled By: StatefulSet/prometheus-prometheus-operator-158278-prometheus
Containers:
prometheus:
Container ID: docker://bbec5226b9c3610a265a5cee046f76044e5ff26149ad5079f774f2a49f46f1e1
Image: quay.io/prometheus/prometheus:v2.15.2
Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:914525123cf76a15a6aaeac069fcb445ce8fb125113d1bc5b15854bc1e8b6353
Port: 9090/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=10d
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.external-url=http://prometheus-operator-158278-prometheus.default:9090
--web.route-prefix=/
State: Running
Started: Mon, 02 Mar 2020 16:30:44 +0800
Last State: Terminated
Reason: Error
Message: caller=main.go:648 msg="Starting TSDB ..."
level=info ts=2020-03-02T08:30:43.925Z caller=web.go:506 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-02T08:30:43.928Z caller=head.go:584 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-03-02T08:30:43.929Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:663 fs_type=EXT4_SUPER_MAGIC
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:664 msg="TSDB started"
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:734 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:517 msg="Stopping scrape discovery manager..."
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:531 msg="Stopping notify discovery manager..."
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:553 msg="Stopping scrape manager..."
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:527 msg="Notify discovery manager stopped"
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:513 msg="Scrape discovery manager stopped"
level=info ts=2020-03-02T08:30:43.930Z caller=manager.go:814 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-03-02T08:30:43.930Z caller=manager.go:820 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-03-02T08:30:43.930Z caller=main.go:547 msg="Scrape manager stopped"
level=info ts=2020-03-02T08:30:43.933Z caller=notifier.go:598 component=notifier msg="Stopping notification manager..."
level=info ts=2020-03-02T08:30:43.933Z caller=main.go:718 msg="Notifier manager stopped"
level=error ts=2020-03-02T08:30:43.933Z caller=main.go:727 err="error loading config from "/etc/prometheus/config_out/prometheus.env.yaml": couldn't load configuration (--config.file="/etc/prometheus/config_out/prometheus.env.yaml"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"

  Exit Code:    1
  Started:      Mon, 02 Mar 2020 16:30:43 +0800
  Finished:     Mon, 02 Mar 2020 16:30:43 +0800
Ready:          True
Restart Count:  1
Liveness:       http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
Readiness:      http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120
Environment:    <none>
Mounts:
  /etc/prometheus/certs from tls-assets (ro)
  /etc/prometheus/config_out from config-out (ro)
  /etc/prometheus/rules/prometheus-prometheus-operator-158278-prometheus-rulefiles-0 from prometheus-prometheus-operator-158278-prometheus-rulefiles-0 (rw)
  /etc/prometheus/secrets/etcd-certs from secret-etcd-certs (ro)
  /prometheus from prometheus-prometheus-operator-158278-prometheus-db (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from prometheus-operator-158278-prometheus-token-57q88 (ro)

prometheus-config-reloader:
Container ID: docker://b1b8e07d22b7bb07864eaad0f45707bf0426e879c31c14f821032a287871cbcf
Image: quay.io/coreos/prometheus-config-reloader:v0.35.0
Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:b75b9b60e6bc7a256b37c66ffef8db074983e800e0f710336d48484f55d51659
Port:
Host Port:
Command:
/bin/prometheus-config-reloader
Args:
--log-format=logfmt
--reload-url=http://127.0.0.1:9090/-/reload
--config-file=/etc/prometheus/config/prometheus.yaml.gz
--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
State: Running
Started: Mon, 02 Mar 2020 16:30:44 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment:
POD_NAME: prometheus-prometheus-operator-158278-prometheus-0 (v1:metadata.name)
Mounts:
/etc/prometheus/config from config (rw)
/etc/prometheus/config_out from config-out (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-operator-158278-prometheus-token-57q88 (ro)
rules-configmap-reloader:
Container ID: docker://351374ae790979c0dfc6758f8b6ca321df47f6143dce250fbb3590606867ee85
Image: quay.io/coreos/configmap-reload:v0.0.1
Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057
Port:
Host Port:
Args:
--webhook-url=http://127.0.0.1:9090/-/reload
--volume-dir=/etc/prometheus/rules/prometheus-prometheus-operator-158278-prometheus-rulefiles-0
State: Running
Started: Mon, 02 Mar 2020 16:30:44 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment:
Mounts:
/etc/prometheus/rules/prometheus-prometheus-operator-158278-prometheus-rulefiles-0 from prometheus-prometheus-operator-158278-prometheus-rulefiles-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-operator-158278-prometheus-token-57q88 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-prometheus-operator-158278-prometheus
Optional: false
tls-assets:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-prometheus-operator-158278-prometheus-tls-assets
Optional: false
config-out:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
prometheus-prometheus-operator-158278-prometheus-rulefiles-0:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-prometheus-operator-158278-prometheus-rulefiles-0
Optional: false
secret-etcd-certs:
Type: Secret (a volume populated by a Secret)
SecretName: etcd-certs
Optional: false
prometheus-prometheus-operator-158278-prometheus-db:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
prometheus-operator-158278-prometheus-token-57q88:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-operator-158278-prometheus-token-57q88
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:

@Yangstatus
Copy link
Author

@Yangstatus Yangstatus commented Mar 4, 2020

I am just beginning to learn. Please give me your advice.

@Yangstatus Yangstatus closed this Mar 5, 2020
@meschbach
Copy link

@meschbach meschbach commented Apr 27, 2020

I am seeing a similar issue with an installation of prometheus-operator. Did you find a solution to your issue?

@Yangstatus
Copy link
Author

@Yangstatus Yangstatus commented Apr 27, 2020

I am seeing a similar issue with an installation of prometheus-operator. Did you find a solution to your issue?

This issue has not been resolved, but it does not affect the normal use of the service.This may be a configuration BUG and I hope this problem can be fixed later.

@brancz
Copy link
Contributor

@brancz brancz commented Apr 27, 2020

This is a race that between the file being provisioned on disk by the sidecar and Prometheus starting, however, Prometheus just tries again and once the file is there it starts just fine. This is primarily a beauty mark, things still work as expected. If someone wants to work on removing this through we'd be more than happy to review! :)

@meschbach
Copy link

@meschbach meschbach commented Apr 27, 2020

Thanks @brancz ! Definitely wasn't reading the pod state correctly and the service did start properly after the sidecar generated the configuration file.

First thought off the top of my head is to exponentially back off for up to 30 seconds if the file doesn't exist, possibly with an additional command line option. Would this be an acceptable solution or is there a better alternative?

Willing to try to code it up.

@Yangstatus
Copy link
Author

@Yangstatus Yangstatus commented Apr 28, 2020

Thanks @brancz ! Definitely wasn't reading the pod state correctly and the service did start properly after the sidecar generated the configuration file.

First thought off the top of my head is to exponentially back off for up to 30 seconds if the file doesn't exist, possibly with an additional command line option. Would this be an acceptable solution or is there a better alternative?

Willing to try to code it up.

That's a good idea!

@brancz
Copy link
Contributor

@brancz brancz commented Apr 29, 2020

If we can find something reasonably small then something minimal that just checks that the file is there and then starts the prometheus process would be best I feel.

@meschbach
Copy link

@meschbach meschbach commented Apr 29, 2020

What would happen if the file isn't present?

If we change the entry point to a script then something like this might work on busybox:

while [ ! -f /etc/prometheus/config_out/prometheus.env.yaml ]; do
    sleep 1
done
exec operator "$@"

Is modifying the entry point a stability concern?

@elisiano
Copy link
Contributor

@elisiano elisiano commented Apr 29, 2020

another approach would be to have an init-container that generates that file the first time, so that when the operator starts the file is there for sure.
This means that the file should be placed onto a shared volume though.

@meschbach
Copy link

@meschbach meschbach commented Apr 29, 2020

I think that is a great solution! I believe the prometheus instance and the configuration reloader share a volume. Do you think it would be an issue to use this volume during initialization?

@brancz
Copy link
Contributor

@brancz brancz commented Apr 30, 2020

Modifying the config reloader to be a one-off instead of long running for the init container sounds like a fantastic idea!

@elisiano
Copy link
Contributor

@elisiano elisiano commented Apr 30, 2020

Modifying the config reloader to be a one-off instead of long running for the init container sounds like a fantastic idea!

yeah before leaving my previous comment I went looking for that source code but my go is weak and I wouldn't know how to make that happen without directions.

@brancz
Copy link
Contributor

@brancz brancz commented May 4, 2020

The reloader is here: https://github.com/coreos/prometheus-operator/blob/master/cmd/prometheus-config-reloader/main.go

And I would suggest we add a new flag, something along the lines of --one-off, which just runs the processing once, instead of watching. It requires a bit of a refactoring in the thanos reloader code, to allow extracting and using only the templating part, separately from the file-watch-reloader, which is currently coupled as far as I can tell. Not super difficult, just needs some work to be put in.

@stale
Copy link

@stale stale bot commented Jul 3, 2020

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

@paulfantom paulfantom changed the title level=error ts=2020-03-02T08:30:43.933Z caller=main.go:727 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory" Prometheus pod cannot find prometheus.env.yaml on start Jun 7, 2021
@paulfantom paulfantom removed the stale label Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment