fix(inputs.prometheus): correctly track deleted pods #12522

redbaron · 2023-01-19T14:05:46Z

Required for all PRs

Updated associated README.md.
Wrote appropriate unit tests.
Pull request title or commits are in conventional commit format

Deleted pods were not always unregistered. It was happening when telegraf is briefly disconnected from Kubernetes API server, in that case Informer OnDelete doesn't just return corev1.Pod object, but a special marker indicating that pod was deleted, but it's state might not be up to date. This PR accounts for such behaviour.

It also reworks Informer is used to track pods, summary is like following:

register/unregister Pod now tracks each pod by their ID "$namespace/$podName"
state of the pod passed in the Informer handler funcs is used as is to simplify tracking logic

Fixes #12527

powersj

Hi,

Can you please file an issue that includes an example of what a user might see or not see in logs when this situation occurs? I'd like to have the history and additional background for the fix.

Also want to ensure that the metrics are not changing due to this only the internal handling.

Some questions, primarily to understand the changes are inline.

Thanks!

plugins/inputs/prometheus/kubernetes.go

powersj · 2023-01-19T17:17:23Z

plugins/inputs/prometheus/kubernetes.go

-				p.Log.Errorf("splitting key into namespace and name %s\n", err.Error())
+			newPod, ok := newObj.(*corev1.Pod)
+			if !ok {
+				p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please")


I realize in the current situation we ignore any error, but if this error should start happening for someone, is there anything we can print to point the user or us to a root cause?

this error shouldn't happen in practice, because we create Informer for Pods and it will be Pod objects which we receive in Add/Update/Delete handlers, so this cast will never fail. It can only start happening if future refactor of this code changes something.

I guess I am wondering what to do if we tell someone to file a bug the only response I have is "that shouldn't happen."

I'd rather this say something like "failed to convert to a pod" and print the string representation of the interface or something to give a hint at what was trying to be converted.

Does that make sense?

Whether it will be triggered or not is dependend entirely on the code, not on environment user runs telegraf in. I figured if they report that they see this log line in the issue, it will be trivial for devs to figure out what exactly is being passed here.

We can remove check and convert just with:

newPod := newObj.(*corev1.Pod)

which will panic should newObj be of the wrong type and it will have all necessary details printed

it will be trivial for devs to figure out what exactly is being passed here.

I would hopefully agree, but given how hard it is to get logs from some users I'd rather we give the user as much info as possible without just telling them to file an issue.

which will panic should newObj

While an option I'd prefer we not panic anywhere in plugins.

I think all I am after is an updated error message, like "did not receive a pod" and print the interface should something ever go wrong.

powersj · 2023-01-19T17:18:00Z

plugins/inputs/prometheus/kubernetes.go

+		UpdateFunc: func(_, newObj interface{}) {
+			newPod, ok := newObj.(*corev1.Pod)
+			if !ok {
+				p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please")


Same question here re: printing something

plugins/inputs/prometheus/kubernetes.go

redbaron · 2023-01-20T09:23:02Z

I created the issue and linked this PR to it

Also want to ensure that the metrics are not changing due to this only the internal handling.

What do you mean?

powersj · 2023-01-20T14:17:27Z

Also want to ensure that the metrics are not changing due to this only the internal handling.

What do you mean?

It appears all of these changes are only to the internal tracking of pods, but I wanted to ensure that no tags or field values are changed due to this PR.

redbaron · 2023-01-20T14:27:28Z

but I wanted to ensure that no tags or field values are changed due to this PR.

that's correct, there shouldn't be any changes like that

powersj

thanks!

redbaron · 2023-01-23T08:38:23Z

~~Problem reappeared even with this fix applied, please hold on merging it.~~ Pushed fix

…ue key namespace/name

…tion is interrupted

…in instances

telegraf-tiger · 2023-01-23T12:48:16Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

☺️ This pull request doesn't significantly change the Telegraf binary size (less than 1%)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz
		static_linux_amd64.tar.gz

srebhan

Looks good to me. Thanks for fixing this issue @redbaron!

@powersj can you please give this a second look after the additional fix?!?!

powersj

Thanks again for looking into this

(cherry picked from commit 51f23d2)

telegraf-tiger bot added area/prometheus fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jan 19, 2023

redbaron force-pushed the kubernetes-lost-pods branch 2 times, most recently from d027d5d to 5a681f5 Compare January 19, 2023 14:23

powersj reviewed Jan 19, 2023

View reviewed changes

powersj approved these changes Jan 20, 2023

View reviewed changes

powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jan 20, 2023

powersj assigned srebhan Jan 20, 2023

redbaron added 4 commits January 23, 2023 12:29

chore(inputs.prometheus): track kubernetes pods by their natural uniq…

dfe2653

…ue key namespace/name

fix(inputs.prometheus): fix handling deleted pods when k8s API connec…

92a8597

…tion is interrupted

fix(inputs.prometheus): scrape only if pod is ready

4f2e3a0

chore(inputs.prometheus): trully share informer between multiple plug…

f0438a2

…in instances

redbaron force-pushed the kubernetes-lost-pods branch from 7a796d5 to f0438a2 Compare January 23, 2023 12:30

srebhan requested a review from powersj January 23, 2023 14:18

srebhan approved these changes Jan 23, 2023

View reviewed changes

srebhan assigned powersj and unassigned srebhan Jan 23, 2023

powersj approved these changes Jan 23, 2023

View reviewed changes

powersj merged commit 51f23d2 into influxdata:master Jan 23, 2023

redbaron deleted the kubernetes-lost-pods branch January 23, 2023 16:19

srebhan pushed a commit that referenced this pull request Jan 30, 2023

fix(inputs.prometheus): correctly track deleted pods (#12522)

9d01a74

(cherry picked from commit 51f23d2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inputs.prometheus): correctly track deleted pods #12522

fix(inputs.prometheus): correctly track deleted pods #12522

redbaron commented Jan 19, 2023 •

edited

powersj left a comment

powersj Jan 19, 2023

redbaron Jan 20, 2023

powersj Jan 20, 2023

redbaron Jan 20, 2023

powersj Jan 20, 2023 •

edited

powersj Jan 19, 2023

redbaron commented Jan 20, 2023

powersj commented Jan 20, 2023

redbaron commented Jan 20, 2023

powersj left a comment

redbaron commented Jan 23, 2023 •

edited

telegraf-tiger bot commented Jan 23, 2023

Artifact URLs

srebhan left a comment

powersj left a comment

fix(inputs.prometheus): correctly track deleted pods #12522

fix(inputs.prometheus): correctly track deleted pods #12522

Conversation

redbaron commented Jan 19, 2023 • edited

Required for all PRs

powersj left a comment

Choose a reason for hiding this comment

powersj Jan 19, 2023

Choose a reason for hiding this comment

redbaron Jan 20, 2023

Choose a reason for hiding this comment

powersj Jan 20, 2023

Choose a reason for hiding this comment

redbaron Jan 20, 2023

Choose a reason for hiding this comment

powersj Jan 20, 2023 • edited

Choose a reason for hiding this comment

powersj Jan 19, 2023

Choose a reason for hiding this comment

redbaron commented Jan 20, 2023

powersj commented Jan 20, 2023

redbaron commented Jan 20, 2023

powersj left a comment

Choose a reason for hiding this comment

redbaron commented Jan 23, 2023 • edited

telegraf-tiger bot commented Jan 23, 2023

Artifact URLs

srebhan left a comment

Choose a reason for hiding this comment

powersj left a comment

Choose a reason for hiding this comment

redbaron commented Jan 19, 2023 •

edited

powersj Jan 20, 2023 •

edited

redbaron commented Jan 23, 2023 •

edited