Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(inputs.prometheus): correctly track deleted pods #12522

Merged
merged 4 commits into from
Jan 23, 2023

Conversation

redbaron
Copy link
Contributor

@redbaron redbaron commented Jan 19, 2023

Required for all PRs

Deleted pods were not always unregistered. It was happening when telegraf is briefly disconnected from Kubernetes API server, in that case Informer OnDelete doesn't just return corev1.Pod object, but a special marker indicating that pod was deleted, but it's state might not be up to date. This PR accounts for such behaviour.

It also reworks Informer is used to track pods, summary is like following:

  • register/unregister Pod now tracks each pod by their ID "$namespace/$podName"
  • state of the pod passed in the Informer handler funcs is used as is to simplify tracking logic

Fixes #12527

@telegraf-tiger telegraf-tiger bot added area/prometheus fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jan 19, 2023
@redbaron redbaron force-pushed the kubernetes-lost-pods branch 2 times, most recently from d027d5d to 5a681f5 Compare January 19, 2023 14:23
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Can you please file an issue that includes an example of what a user might see or not see in logs when this situation occurs? I'd like to have the history and additional background for the fix.

Also want to ensure that the metrics are not changing due to this only the internal handling.

Some questions, primarily to understand the changes are inline.

Thanks!

plugins/inputs/prometheus/kubernetes.go Show resolved Hide resolved
p.Log.Errorf("splitting key into namespace and name %s\n", err.Error())
newPod, ok := newObj.(*corev1.Pod)
if !ok {
p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize in the current situation we ignore any error, but if this error should start happening for someone, is there anything we can print to point the user or us to a root cause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error shouldn't happen in practice, because we create Informer for Pods and it will be Pod objects which we receive in Add/Update/Delete handlers, so this cast will never fail. It can only start happening if future refactor of this code changes something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am wondering what to do if we tell someone to file a bug the only response I have is "that shouldn't happen."

I'd rather this say something like "failed to convert to a pod" and print the string representation of the interface or something to give a hint at what was trying to be converted.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether it will be triggered or not is dependend entirely on the code, not on environment user runs telegraf in. I figured if they report that they see this log line in the issue, it will be trivial for devs to figure out what exactly is being passed here.

We can remove check and convert just with:

newPod := newObj.(*corev1.Pod)

which will panic should newObj be of the wrong type and it will have all necessary details printed

Copy link
Contributor

@powersj powersj Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be trivial for devs to figure out what exactly is being passed here.

I would hopefully agree, but given how hard it is to get logs from some users I'd rather we give the user as much info as possible without just telling them to file an issue.

which will panic should newObj

While an option I'd prefer we not panic anywhere in plugins.

I think all I am after is an updated error message, like "did not receive a pod" and print the interface should something ever go wrong.

UpdateFunc: func(_, newObj interface{}) {
newPod, ok := newObj.(*corev1.Pod)
if !ok {
p.Log.Error("[BUG] Not a Pod, report it to the Github issues, please")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here re: printing something

plugins/inputs/prometheus/kubernetes.go Show resolved Hide resolved
@redbaron
Copy link
Contributor Author

I created the issue and linked this PR to it

Also want to ensure that the metrics are not changing due to this only the internal handling.

What do you mean?

@powersj
Copy link
Contributor

powersj commented Jan 20, 2023

Also want to ensure that the metrics are not changing due to this only the internal handling.

What do you mean?

It appears all of these changes are only to the internal tracking of pods, but I wanted to ensure that no tags or field values are changed due to this PR.

@redbaron
Copy link
Contributor Author

but I wanted to ensure that no tags or field values are changed due to this PR.

that's correct, there shouldn't be any changes like that

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jan 20, 2023
@redbaron
Copy link
Contributor Author

redbaron commented Jan 23, 2023

Problem reappeared even with this fix applied, please hold on merging it. Pushed fix

@telegraf-tiger
Copy link
Contributor

@srebhan srebhan requested a review from powersj January 23, 2023 14:18
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for fixing this issue @redbaron!

@powersj can you please give this a second look after the additional fix?!?!

@srebhan srebhan assigned powersj and unassigned srebhan Jan 23, 2023
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for looking into this

@powersj powersj merged commit 51f23d2 into influxdata:master Jan 23, 2023
@redbaron redbaron deleted the kubernetes-lost-pods branch January 23, 2023 16:19
srebhan pushed a commit that referenced this pull request Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prometheus fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheus input attempts to contact deleted pods
3 participants