Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail: fix TargetManager.run() not exit after stop is called #5238

Merged

Conversation

littlepangdi
Copy link
Contributor

@littlepangdi littlepangdi commented Jan 26, 2022

What this PR does / why we need it:

This PR fix a possible goroutine leak in promtail:

  • FileTargetManager.run()will not exit after FileTargetManager.stop() is called.

manager.SyncCh() in run() is a read-only channel copy of manager.syncCh, and there is no close(ch) operation in manager's whole lifecycle. See code search below:
image

So it would be stuck in range ch operation.

By passing the contextrelated to TargetManager.cancelfunc , which is also passed to the manager above to control its lifecycle, the problem is fixed and tested in our own project.

Which issue(s) this PR fixes:
Fixes #5237

Special notes for your reviewer:

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@littlepangdi littlepangdi requested a review from a team as a code owner January 26, 2022 12:22
@CLAassistant
Copy link

CLAassistant commented Jan 26, 2022

CLA assistant check
All committers have signed the CLA.

@littlepangdi littlepangdi force-pushed the fix/promtail-targetmanager-not-exit branch from 132cb0d to 12b61a6 Compare January 26, 2022 12:37
@jeschkies
Copy link
Contributor

Great find! I'm wondering why it is not closed on the side of Prometheus.

@littlepangdi
Copy link
Contributor Author

littlepangdi commented Jan 26, 2022

Great find! I'm wondering why it is not closed on the side of Prometheus.

Me, too. I look into Prometheus project. Fun fact is that seems all the functions in Prometheus project that reference manager.SyncCh() will use ctx and for-select mechanism to make sure exit normally.

Place where write stuff to manager.synch also does so. For example,
image

Write and read process can be shutdown via ctx. Hence, there is actually no need to manully shutdown ch in Prometheus :)

@littlepangdi littlepangdi force-pushed the fix/promtail-targetmanager-not-exit branch from 12b61a6 to 8a16043 Compare January 27, 2022 01:52
@littlepangdi littlepangdi changed the title feat(promtail): fix TargetManager.run() not exit Promtail: fix TargetManager.run() not exit after stop is called Jan 27, 2022
@jeschkies
Copy link
Contributor

@littlepangdi I've talk to maintainers from Prometheus. Would you mind proposing a change there? I think the channel should be closed when the goroutine is done.

@littlepangdi
Copy link
Contributor Author

littlepangdi commented Jan 27, 2022

@littlepangdi I've talk to maintainers from Prometheus. Would you mind proposing a change there? I think the channel should be closed when the goroutine is done.

@jeschkies It does make sense to close on the source, I'll be glad to follow up on this issue there tomorrow.

Also, about this PR , I think following the same ctx & calcelFunc mechanism of Prometheus makes our code more rigorous and readable

  • This fallback makes sure our own goroutines end perfectly, whether the channel is closed or not.
  • Send out a message that when ctx ends ( the same ctx we send to discovery.Manager ), all goroutines need to end.

What do you think?

@jeschkies
Copy link
Contributor

@littlepangdi, that's a very good question. Is there a chance we loose a message when our goroutine finishes before the manager shuts down?

@RangerCD
Copy link
Contributor

@littlepangdi, that's a very good question. Is there a chance we loose a message when our goroutine finishes before the manager shuts down?

This reminds me that there might be another goroutine leakage in this particular scenario, if a message comes in while/after FileTargetManager is shutting down. This message could still be processed after Stop has already been executed, which means new targets and new goroutines will be created and never stop.

In other words, Stop must wait until run has returned, then it's safe to stop tm.syncers here.

@littlepangdi littlepangdi force-pushed the fix/promtail-targetmanager-not-exit branch from 8a16043 to cf54848 Compare January 28, 2022 07:38
@littlepangdi
Copy link
Contributor Author

@littlepangdi, that's a very good question. Is there a chance we loose a message when our goroutine finishes before the manager shuts down?

@jeschkies Yes, it does when shutting down our goroutine via ctx, seems there is no better way to fix this defect now. I've proposed a PR in Prometheus, hopefully it will be passed soon :)
prometheus/prometheus#10218

About another goroutine leakage proposed by @RangerCD , I just added a done channel to make it synchronous ,it 's similar to filetarget in Promtail here.

func (t *FileTarget) Stop() {
close(t.quit)
<-t.done
t.handler.Stop()
}

Or should I do this in another PR? Let me know if anything I can do here

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this is cleaner. We're not at risk of what Prometheus is doing.

LGTM on my side.

@jeschkies can you merge unless you really disagree here ?

@jeschkies jeschkies merged commit a4cfab4 into grafana:main Jan 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Promtail: TargetManager.run() not exit after stop is called
5 participants