-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(inputs.prometheus): correctly track deleted pods #12522
Conversation
d027d5d
to
5a681f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Can you please file an issue that includes an example of what a user might see or not see in logs when this situation occurs? I'd like to have the history and additional background for the fix.
Also want to ensure that the metrics are not changing due to this only the internal handling.
Some questions, primarily to understand the changes are inline.
Thanks!
p.Log.Errorf("splitting key into namespace and name %s\n", err.Error()) | ||
newPod, ok := newObj.(*corev1.Pod) | ||
if !ok { | ||
p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize in the current situation we ignore any error, but if this error should start happening for someone, is there anything we can print to point the user or us to a root cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error shouldn't happen in practice, because we create Informer for Pods and it will be Pod objects which we receive in Add/Update/Delete handlers, so this cast will never fail. It can only start happening if future refactor of this code changes something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am wondering what to do if we tell someone to file a bug the only response I have is "that shouldn't happen."
I'd rather this say something like "failed to convert to a pod" and print the string representation of the interface or something to give a hint at what was trying to be converted.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether it will be triggered or not is dependend entirely on the code, not on environment user runs telegraf in. I figured if they report that they see this log line in the issue, it will be trivial for devs to figure out what exactly is being passed here.
We can remove check and convert just with:
newPod := newObj.(*corev1.Pod)
which will panic should newObj
be of the wrong type and it will have all necessary details printed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be trivial for devs to figure out what exactly is being passed here.
I would hopefully agree, but given how hard it is to get logs from some users I'd rather we give the user as much info as possible without just telling them to file an issue.
which will panic should newObj
While an option I'd prefer we not panic anywhere in plugins.
I think all I am after is an updated error message, like "did not receive a pod" and print the interface should something ever go wrong.
UpdateFunc: func(_, newObj interface{}) { | ||
newPod, ok := newObj.(*corev1.Pod) | ||
if !ok { | ||
p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here re: printing something
I created the issue and linked this PR to it
What do you mean? |
It appears all of these changes are only to the internal tracking of pods, but I wanted to ensure that no tags or field values are changed due to this PR. |
that's correct, there shouldn't be any changes like that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
…ue key namespace/name
…tion is interrupted
7a796d5
to
f0438a2
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for looking into this
(cherry picked from commit 51f23d2)
Required for all PRs
Deleted pods were not always unregistered. It was happening when telegraf is briefly disconnected from Kubernetes API server, in that case Informer OnDelete doesn't just return corev1.Pod object, but a special marker indicating that pod was deleted, but it's state might not be up to date. This PR accounts for such behaviour.
It also reworks Informer is used to track pods, summary is like following:
Fixes #12527