Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add ingester handler for shutdown and forget tokens #6179

Merged
merged 3 commits into from
Jun 14, 2022

Conversation

chaudum
Copy link
Contributor

@chaudum chaudum commented May 17, 2022

What this PR does / why we need it:

This handler can be used to gracefully shut down a Loki instance and
delete the file that persists the tokens of the ingester ring.

In production environments you usually want to persist ring tokens so
that during a restart of an ingester instance, or during rollout, the
tokens from that instance are not re-distributed to other instances, but
instead kept so that the same streams end up on the same instance once
it is up and running again. For that, the tokens are written to a file
that can be specified via the -ingester.tokens-file-path argument.

In certain cases, however, you want to forget the tokens and
re-distribute them when shutting down an ingester instance. This was
already possible by calling /ingester/flush_shutdown, deleting the
tokens file and terminating the instance. The new handler
/ingester/shutdown_and_forget combines these manual steps into a
single handler.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

The first commit with the vendor changes can be removed once the PR against grafana/dskit has been merged.

Checklist

  • Documentation added
  • Tests updated
  • Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.2%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.1%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.2%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.1%

@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from cc4695e to 4016250 Compare May 19, 2022 08:17
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.2%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.1%

@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from 4016250 to 18c75e2 Compare May 19, 2022 08:49
@chaudum chaudum marked this pull request as ready for review May 19, 2022 09:09
@chaudum chaudum requested review from KMiller-Grafana and a team as code owners May 19, 2022 09:09
Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great. Just left some comments.

@@ -47,6 +47,7 @@ These endpoints are exposed by the ingester:

- [`POST /flush`](#post-flush)
- [`POST /ingester/flush_shutdown`](#post-ingesterflush_shutdown)
- [`POST /ingester/shutdown_and_forget`](#post-ingestershutdown_and_forget)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown_and_forget doesn't make it clear what it would forget. What do you think about calling it shutdown_and_forget_ring_tokens or shutdown_and_remove_ring_tokens?
We can also add a query param to flush_shutdown like /ingester/flush_shutdown?forget_ring_tokens=true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand I like the idea of a single endpoint, but on the other hand, I find it weird to have query parameters in URLs that only allow POST requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense then to change the URL to /ingester/shutdown and make both flush and delete_tokens URL parameters.

/ingester/shutdown?flush=true
/ingester/shutdown?delete_tokens=true

(Of course keep the existing URL and deprecate it.)

What do you think @sandeepsukhani ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about /ingester/shutdown/?unregister=true?

Keep in mind as well that documentation can clarify the purpose and mechanism for an API endpoint, if need be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it weird to have query parameters in URLs that only allow POST requests

I agree it looks ugly but it helps us keep it tidy. If we keep adding different paths for various use cases, it would start getting messier.

I think delete_ring_tokens or reset_ring_tokens or something similar would be a bit self-descriptive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint only accepts POSTs, why not include these parameters in a request body? I do like the idea of a single endpoint with parameters.

// ingester service as "failed", so Loki will shut down entirely.
// The module manager logs the failure `modules.ErrStopProcess` in a special way.
if i.lifecycler.ClearTokensOnShutdown() && errs.Err() == nil {
return modules.ErrStopProcess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just rely on nil error? We can return internal server error from ShutdownAndForgetHandler handler when error is non-nil or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we need to be able to distinguish between stopping of the ingester service that happened through the signal handler, and stopping of the service that was triggered by the shutdown handler. In case of the former, the error will be nil as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, we are returning w.WriteHeader(http.StatusNoContent) so I don't see any special handling in ShutdownAndForgetHandler for either of the cases. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean, but the returned error modules.ErrStopProcess is not only used by the shutdown handler, but also by the service manager in the Run() function, which determines whether to stop Loki when one of its services failed.

level.Info(util_log.Logger).Log("msg", "received stop signal via return error", "module", m, "error", service.FailureCase())

pkg/storage/stores/series/series_index_store_test.go Outdated Show resolved Hide resolved
@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from 18c75e2 to f2339b0 Compare May 30, 2022 11:50
@pull-request-size pull-request-size bot added size/M and removed size/L labels May 30, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0.1%

@chaudum
Copy link
Contributor Author

chaudum commented May 30, 2022

@sandeepsukhani @dannykopping I opted for /ingester/shutdown and restructured the handler a bit.

pkg/loki/modules.go Show resolved Hide resolved
docs/sources/api/_index.md Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 1, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from b836d0d to d4ff4d2 Compare June 2, 2022 11:45
@chaudum chaudum requested a review from dannykopping June 2, 2022 11:50
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chaudum
Copy link
Contributor Author

chaudum commented Jun 2, 2022

I'll squash commits once approved.

@dannykopping
Copy link
Contributor

I'll squash commits once approved.

No need, we squash on merge

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a quick review, but looks good!

`/ingester/shutdown` is similar to the [`/ingester/flush_shutdown`](#post-ingesterflush_shutdown)
endpoint, but accepts three URL query parameters `flush`, `delete_ring_tokens`, and `terminate`.

**URL query parameters:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a POST, curious why we opt for url params intead of a request body?

Copy link
Contributor Author

@chaudum chaudum Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Query params are I guess easier to write when you execute the request using curl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we do not follow any strict, restful HTTP API design guides in our endpoint design.

@chaudum chaudum requested a review from dannykopping June 3, 2022 07:45
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from 42b9b81 to 3cea8d7 Compare June 7, 2022 11:45
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0.1%

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
This handler replaces the deprecated /ingester/flush_shutdown handler
and can be used to gracefully shut down a Loki instance and delete the
file that persists the tokens of the ingester ring.

In production environments you usually want to persist ring tokens so
that during a restart of an ingester instance, or during rollout, the
tokens from that instance are not re-distributed to other instances, but
instead kept so that the same streams end up on the same instance once
it is up and running again. For that, the tokens are written to a file
that can be specified via the `-ingester.tokens-file-path` argument.

In certain cases, however, you want to forget the tokens and
re-distribute them when shutting down an ingester instance. This was
already possible by calling `/ingester/flush_shutdown`, deleting the
tokens file and terminating the process. The new handler
`/ingester/shutdown` combines these manual steps into a
single handler.

Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
@chaudum chaudum force-pushed the chaudum/shutdown-and-forget branch from 3cea8d7 to 095653c Compare June 14, 2022 13:07
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.4%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0.1%

@dannykopping dannykopping merged commit c12a1f4 into main Jun 14, 2022
@dannykopping dannykopping deleted the chaudum/shutdown-and-forget branch June 14, 2022 13:44
@osg-grafana osg-grafana added type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories and removed area/docs labels Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants